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Preface 


This  volume  is  a  compilation  of  the  edited  proceedings  of  the  “Parallel  Computing  in  CFD”  course  held  at  the  von  Karman 
Institute  (VKI)  in  Rhode-Saint-Genfese,  Belgium,  15-19  May  1995  and  at  the  NASA  Ames  Research  Center,  Moffett  Field, 
California,  USA  16-20  October  1995. 

In  order  to  circumvent  the  limits  posed  by  processor  performance,  today’s  advanced  computer  architectures  permit 
simultaneous  computations  on  multiple  functional  units.  This  approach,  termed  parallel  processing,  has  the  potential  for  a 
dramatic  improvement  in  overall  computational  speed.  This  revolution  in  parallel  processing  is  expected  to  strongly  influence 
the  choice  of  algorithms  used  in  computational  fluid  dynamics  (CFD). 

This  series  of  lectures,  supported  by  the  AGARD  Fluid  Dynamics  Panel  and  the  von  Karman  Institute,  presents  and  discusses 
the  latest  in  advances  and  future  trends  in  the  application  of  parallel  computing  to  solve  computationally  intensive  problems 
in  CFD.  Topics  in  this  lecture  series  focus  on  the  increasingly  sophisticated  types  of  architectures  now  available,  and  how  to 
exploit  these  architectures  by  appropriate  algorithms  for  the  simulation  of  fluid  flow. 

Some  of  the  subjects  discussed  are:  parallel  algorithms  for  computing  compressible  and  incompressible  flow;  domain 
decomposition  algorithms  and  partitioning  techniques;  and  parallel  algorithms  for  solving  linear  systems  arising  from  the 
discretized  partial  differential  equations. 

We  want  to  thank  all  the  speakers  for  their  outstanding  work,  as  well  as  the  organizers  at  AGARD,  VKI,  and  NASA  Ames. 


Preface 


Ce  volume  est  un  recueil  des  exposes  du  cours  sur  «Le  calcul  en  parallele  en  CFD»  organise  a  I’lnstitut  Von  Karman  (VKI)  a 
Rhodes-Saint-Genese,  en  Belgique,  du  15  au  19  mai  1995,  ainsi  qu’au  NASA  Ames  Research  Center,  Moffett  Field,  en 
Califomie  aux  Etats-Unis,  du  16  au  20  octobre  1995. 

Afin  de  circonvenir  les  limitations  imposees  par  les  performances  des  microprocesseurs,  les  architectures  informatiques 
avancees  d’aujourd’hui  permettent  le  calcul  simultane  realise  sur  de  multiples  unites  fonctionnelles.  Cette  approche,  appelee 
«le  calcul  en  parallMe»,  pourrait  amener  une  amelioration  spectaculaire  des  vitesses  de  calcul  generales.  Cette  revolution  dans 
le  calcul  en  parallele  devrait  exercer  une  forte  influence  sur  le  choix  des  algorithmes  a  utiliser  en  aerodynamique  numerique 
(CFD). 

Ce  cours,  organise  sous  I’egide  conjointe  du  Panel  AGARD  de  la  dynamique  des  fluides  et  de  I’lnstitut  Von  Karman,  presente 
et  examine  les  demiers  progres  realises  ainsi  que  les  perspectives  d’avenir  en  ce  qui  conceme  1’ application  du  calcul  en 
parallele  a  la  resolution  de  certains  problemes  en  CFD  impliquant  une  grande  puissance  de  calcul.  Les  sujets  examines  lors  du 
cours  ont  porte,  principalement,  sur  les  architectures  de  plus  en  plus  sophistiquees  qui  sont  actuellement  disponibles  et  sur 
leur  exploitation  par  I’intermediaire  des  algorithmes  appropries,  pour  la  simulation  des  ecoulements  des  fluides. 

Parmi  les  sujets  examines  Ton  distingue:  les  algorithmes  paralleles  pour  le  calcul  des  ecoulements  compressibles  et  non- 
compressibles;  les  algorithmes  de  decomposition  de  domaine  et  les  techniques  de  decoupage  en  partitions  et  les  algorithmes 
paralleles  pour  la  resolution  de  systemes  lineaires  resultant  des  equations  aux  derivees  partielles  discretisees. 

Nous  tenons  a  remercier  I’ensemble  des  conferenciers  pour  la  qualite  de  leurs  contributions,  ainsi  que  les  organisateurs  a 
I’AGARD,  au  VKI  et  au  NASA  Ames. 
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Parallel  Computers  and  Parallel  Algorithms  for  CFD  : 

An  Introduction 


Dirk  Roose  and  Rafael  Van  Driessche 

Katholieke  Universiteit  Leuven 
Dept,  of  Computer  Science 
Celestijnenlaan  200A 
B-3001  Heverlee-Leuven,  Belgium 
E-mail :  Dir k. Roose@cs . kuleu ven  .ac . be 


1  SUMMARY 

This  text  presents  a  tutorial  on  those  aspects  of  par¬ 
allel  computing  that  are  important  for  the  develop¬ 
ment  of  efficient  parallel  algorithms  and  software  for 
Computational  Fluid  Dynamics. 

We  first  review  the  main  architectural  features  of 
parallel  computers  and  we  briefly  describe  some  par¬ 
allel  systems  on  the  market  today.  We  introduce 
some  important  concepts  concerning  the  develop¬ 
ment  and  the  performance  evaluation  of  parallel  al¬ 
gorithms.  We  discuss  how  work  load  imbalance  and 
communication  costs  on  distributed  memory  paral¬ 
lel  computers  can  be  minimised.  We  present  perfor¬ 
mance  results  for  some  CFD  testcases.  We  focus  on 
applications  using  structured  and  block  structured 
grids,  but  the  concepts  and  techniques  are  valid  also 
for  unstructured  grids. 


2  PARALLEL  COMPUTING 

2.1  Parallel  computer  architectures 

2.1.1  Classification  of  Flynn 

For  many  years  the  taxonomy  of  Flynn  has  been  used 
for  the  classification  of  high-performance  computers. 
This  classification  is  based  on  the  way  instruction 
and  data  streams  are  handled  (Single  or  Multiple 
Instruction  /  Data  Streams).  This  leads  to  a  clas¬ 
sification  in  three  main  architectural  classes,  see  e.g. 

[Ij- 

SISD  systems.  These  are  the  conventional  sys¬ 
tems  (workstations,  compute-servers)  that  contain 
one  CPU  and  hence  can  execute  one  instruction 
stream  in  serial  mode.  Nowadays  many  large 
compute-servers  or  mainframes  have  more  than  one 
CPU  but  these  are  most  often  used  to  execute  un¬ 
related  jobs  (instruction  streams).  Therefore,  such 
systems  should  be  regarded  as  (a  couple  of)  SISD 
machines. 


SIMD  systems.  Such  systems  have  a  large  num¬ 
ber  of  (simple)  processing  units,  ranging  from  1,024 
up  to  64K,  that  all  may  execute  the  same  instruction 
on  different  data  in  lock-step.  Thus  a  single  instruc¬ 
tion  manipulates  many  data  items  in  parallel.  In  the 
past,  SIMD  machines  such  as  the  Connection  Ma¬ 
chine  CM-2  of  Thinking  Machines  and  the  MasPar 
have  been  quite  successful.  Today,  the  SIMD  archi¬ 
tecture  has  nearly  disappeared,  except  in  systems  for 
specific  application  areas,  such  as  image  processing, 
that  are  dominated  by  highly  structured  data  sets 
and  data  access  patterns. 

MIMD  systems.  In  ‘Multiple  Instruction,  Multi¬ 
ple  Data’  systems,  the  processors  independently  ex¬ 
ecute  different  instruction  streams,  each  on  its  own 
data.  Hardware  and  software  are  designed  so  that 
processors  can  cooperate  efficiently.  Parallel  process¬ 
ing  occurs  when  tasks  executed  on  different  proces¬ 
sors  together  form  one  single  job. 

Vector  processors  are  often  considered  as  a  subclass 
of  SIMD  systems.  Vector  processors  contain  special 
hardware  (‘vector  units’)  to  perform  operations  on 
arrays  of  similar  data  in  a  pipelined  fashion.  These 
vector  units  can  deliver  results  with  a  rate  of  one, 
two  and — in  special  cases — three  per  clock  cycle. 
From  the  programmer’s  point  of  view,  vector  proces¬ 
sors  operate  on  their  data  in  an  almost  parallel  way 
(SIMD-style)  when  executing  in  vector  mode.  Vector 
processors  are  used  in  the  Cray  C90,  J916  and  T90- 
series,  the  Convex  C-series,  Fujitsu  VP-series,  NEC 
SX  -series,  etc. 

The  pipelined  execution  of  floating  point  operations 
is  also  a  key  concept  in  RISC  processors,  used  in  high 
performance  workstations.  Advanced  RISC  proces¬ 
sors  can  also  execute  several  instruction  in  parallel 
(e.g.  ‘dual  instruction  mode’). 

2.1.2  Memory  organisation  of  MIMD  sys¬ 
tems 

Parallel  computers  can  also  be  classified  based  to 
other  criteria.  MIMD  systems  are  further  distin¬ 
guished  according  to  the  organisation  of  the  memory. 


Paper  presented  in  an  AGARD-FDP-VKI  Special  Course  on  "Parallel  Computing  in  CFD",  held  at  the  VKI,  Rhode-Saint-Genese,  Belgium, 
from  15-19  May  1995  and  16-20  October  1995  at  NASA  Ames,  United  States  and  published  in  R-S07. 
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Shared  Memory  System 
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Fig.  1:  Shared  Memory  MIMD  parallel  computers:  possible  intercon¬ 
nections. 


Shared  memory  MIMD  systems.  In  shared 
memory  MIMD  systems  all  processors  have  access 
to  a  common  memory.  The  main  architectural  prob¬ 
lem  in  shared  memory  systems  is  that  of  the  con¬ 
nection  of  the  processors  to  the  memory  (or  mem¬ 
ory  modules).  As  more  processors  are  added,  the 
collective  bandwidth  to  the  memory  ideally  should 
increase  linearly  with  the  number  of  processors,  P. 
Unfortunately,  full  interconnection  is  very  costly,  re¬ 
quiring  O(P^)  connections.  So,  various  alternative 
interconnection  networks  are  used,  some  of  which 
are  shown  in  Fig.  1.  A  crossbar  uses  interconnec¬ 
tions  and  a' omega-network  uses  Plog2P  connections, 
while  a  central  bus  represents  only  one  connection. 
In  all  present-day  multi-processor  vector  processors, 
a  crossbar  is  used.  Due  to  the  limited  capacity  or  the 
cost  of  the  interconnection  network,  shared  memory 
parallel  computers  are  not  scalable  to  a  very  high 
number  of  processors. 

The  shared  memory  concept  is  used  already  for  ±  10 
years  in  multiprocessor  vector  machines  (Cray  X-MP, 
IBM  3090,  their  successors  and  similar  systems  from 
other  vendors).  However  these  systems  have  not  been 
used  very  often  as  truly  parallel  systems;  most  jobs 
use  only  one  processor.  One  reason  for  this  is  the 
limited  number  of  processors  (often  4  or  8),  which 
limits  the  possible  ‘speedup’  of  the  execution.  More¬ 
over,  because  of  time-sharing,  the  user  normally  has 
no  full  control  on  the  number  of  processors  allocated 
to  his  job  at  a  particular  moment.  This  may  also 
limit  the  speedup  that  can  be  achieved. 

Nowadays,  a  number  of  vendors  (Convex,  Silicon 
Graphics,  . . . )  offer  shared  memory  MIMD  systems 
based  on  RISC  processors,  with  up  to  ±  20  proces¬ 


Distributed  memory  MIMD  systems.  A  dis¬ 
tributed  memory  MIMD  parallel  computer  consists 
of  a  number  of  processors,  each  with  its  own  local 
memory,  interconnected  by  a  communication  net¬ 
work.  The  combination  of  a  processor  and  its  lo¬ 
cal  memory  is  often  called  a  processing  node.  Each 
processing  node  is  in  fact  a  complete  computer,  op¬ 
erating  rather  independently  from  the  other  nodes. 
Processing  nodes  can  only  communicate  by  passing 
messages  over  the  communication  network. 

Also  for  distributed  memory  machines,  the  structure 
of  the  communication  network  is  of  crucial  impor¬ 
tance.  Ideally,  one  would  like  to  have  a  completely 
connected  system  where  each  processing  node  is  di¬ 
rectly  connected  to  every  other  node.  However,  this 
is  not  feasible  for  a  large  number  of  nodes.  There¬ 
fore  the  processing  nodes  are  arranged  in  some  inter¬ 
connection  topology.  The  richness  of  the  connection 
structure  has  to  be  balanced  against  the  costs. 

The  hypercube  topology  has  been  used  in  several  sys¬ 
tems  in  the  past.  A  nice  feature  is  that  for  a  hyper¬ 
cube  with  P  —  2"^  processing  nodes  the  ‘diameter’  of 
the  network  (i.e.  the  maximum  number  of  links  be¬ 
tween  any  two  nodes)  is  d.  So,  the  diameter  grows 
only  logarithmically  with  the  number  of  nodes.  In 
addition,  it  is  possible  to  simulate  on  a  hypercube 
many  other  topologies,  such  as  trees,  rings,  2-D  and 
3-D  meshes,  since  these  topologies  are  subsets  of  the 
hypercube  topology. 

In  the  current  parallel  systems,  the  network  topology 
and  the  communication  diameter  are  of  less  impor¬ 
tance,  because  these  systems  employ  some  form  of 
‘wormhole  routing’  of  messages.  This  means  that  as 
soon  as  a  communication  path  between  two  nodes 
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is  established,  the  data  is  sent  through  this  path 
without  disturbing  the  operation  of  the  intermediate 
nodes.  Except  for  a  small  amount  of  time  in  setting 
up  the  communication  path  between  nodes,  the  com¬ 
munication  time  has  become  virtually  independent  of 
the  distance  between  the  nodes. 

Some  systems  use  a  2-D  or  3-D  mesh  structure  for 
the  network.  The  rationale  for  this  is  that  this  inter¬ 
connection  topology  is  sufficient  for  most  algorithms 
used  in  large-scale  scientific  computing  and  that  a 
richer  interconnection  structure  hardly  pays  off.  In 
other  systems  a  multi-stage  network  is  used,  e.g.  an 
omega-network  as  shown  in  Figs.  1  and  4.  Multi¬ 
stage  networks  have  the  advantage  that  the  ‘bisec¬ 
tion  bandwidth’  can  scale  linearly  with  the  number  of 
processors,  while  maintaining  a  fixed  number  of  com¬ 
munication  links  per  processor.  The  bisection  band¬ 
width  of  a  distributed  memory  system  is  defined  as 
the  bandwidth  available  on  all  communication  links 
that  connect  one  half  of  the  system  (P/2  processors) 
with  the  second  half. 

An  important  advantage  of  distributed  memory  sys¬ 
tems  is  that  this  architecture  suffers  less  from  the 
scalability  problem.  The  network  (with  its  limited 
bandwidth)  has  to  be  used  only  when  processing 
nodes  communicate,  not  for  every  memory  access. 
A  disadvantage  is  that  the  communication  overhead 
is  (much)  higher  than  the  overhead  caused  by  using 
shared  data  in  a  shared  memory  system.  When  the 
structure  of  a  problem  dictates  a  frequent  exchange 
of  data  between  processors,  it  may  well  be  that  only 
a  very  small  fraction  of  the  theoretical  peak  perfor¬ 
mance  can  be  achieved. 

The  first  generation  DM-MIMD  systems  were  based 
on  simple,  inexpensive  microprocessors.  Thus  even 
when  100  to  1000  processors  were  interconnected,  the 
peak  performance  of  these  machines  was  lower  than 
that  of  typical  vector  processors  and  shared  memory 
parallel  supercomputers.  Today,  distributed  mem¬ 
ory  parallel  computers  are  often  outperforming  more 
traditional  supercomputers.  This  is  due  to  the  fast 
growing  performance  of  the  RISC  processors  used  in 
distributed  memory  systems  and  due  to  the  greatly 
improved  network  technology.  Moreover,  many  sys¬ 
tems  now  have  sophisticated  hardware  and  software 
that  allow  fast  parallel  I/O  to  disk  storage  (i.e.  a 
‘Concurrent  File  System’).  As  a  result,  distributed 
memory  parallel  systems  are  rapidly  gaining  impor¬ 
tance  in  fields  where  computational  performance  is 
important  such  as  Computational  Fluid  Dynamics. 

Examples  of  distributed  memory  machines  are  the 
Intel  Paragon,  the  CM-5  of  Thinking  Machines,  Cray 
T3D,  IBM  SP2.  Distributed  memory  systems  with 
more  than  a  thousand  processors  exist,  but  most  sys¬ 
tems  have  16  to  128  processors. 

Since  a  few  years,  networks  or  clusters  of  worksta¬ 
tions  are  used  as  ‘low  cost’  distributed  memory  paral¬ 
lel  computers.  A  workstation  cluster  allows  to  exploit 
otherwise  unused  computing  capacity.  Of  course,  if 
workstations  are  simply  connected  together  via  Eth¬ 
ernet  (or  even  via  a  fast  FDDI  interconnection),  the 


number  of  workstation  that  can  be  used  effectively 
together  as  a  parallel  system  is  limited,  because 
of  the  limited  communication  performance.  Some 
workstation  vendors  offer  interconnection  switches  to 
provide  fast  communication  (e.g.  Digital).  Such  clus¬ 
ters  are  bridging  the  gap  with  ‘truly  parallel  comput¬ 
ers’. 

Thus  a  whole  range  of  systems  are  used  nowadays  as 
parallel  computers,  ranging  from  small  workstation 
clusters  to  large  systems  with  many  processors  and 
sophisticated  communication  network  technology. 

Hybrid  memory  organisations.  Although  the 
difference  between  shared  and  distributed  memory 
systems  seems  clear  cut,  many  parallel  systems  have 
a  hybrid  memory  organisation.  In  a  shared  mem¬ 
ory  system,  every  processor  may  have  a  large  cache, 
which  can  be  considered  as  a  local  memory.  Some 
systems  have  a  two-level  organisation:  processors  are 
grouped  together  in  shared  memory  modules,  which 
are  interconnected  via  a  communication  network. 
Finally,  a  distributed  memory  system  may  contain 
hardware  and  software  support  to  access  data  in 
other  processor’s  memories  in  a  way  that  is  trans¬ 
parent  to  the  user.  Depending  on  the  precise  form  of 
this  support,  this  is  called  ‘virtual  shared  memory’, 
‘global  shared  memory’,  ‘global  virtual  memory’,  etc. 

Memory  hierarchy  and  performance.  Both 
vector  processors  and  RISC  processors  can  perform 
floating  point  operations  much  faster  than  data  can 
be  read  and  written  into  main  memory.  Vector  reg¬ 
isters  (in  vector  processors)  or  cache  memories  (in 
RISC  processors)  are  placed  between  the  processor 
and  the  main  memory.  These  very  fast  memory  mod¬ 
ules  should  keep  the  processors  busy  with  compu¬ 
tation  without  having  to  frequently  reference  main 
memory. 

Vector  registers,  caches,  local  memories  and/or  the 
global  (shared)  memory  form  together  a  memory  hi¬ 
erarchy.  The  performance  that  can  be  achieved  for  a 
given  application  program  critically  depends  on  the 
(re-)use  of  data  stored  in  the  ‘higher  levels’  of  the 
memory  hierarchy.  Thus  in  order  to  achieve  a  high 
performance,  the  algorithms  should  exhibit  locality  of 
data  access,  both  in  (address)  space  and  time. 

2.2  Programming  parallel  computers 

We  have  indicated  that  there  is  no  clear  cut  distinc¬ 
tion  between  shared  memory  and  distributed  mem¬ 
ory  parallel  architectures,  and  that  some  recent  par¬ 
allel  systems  have  a  hybrid  organisation.  However 
we  can  clearly  distinguish  between  two  different  pro¬ 
gramming  models,  the  shared  memory  and  the  dis¬ 
tributed  memory  programming  model. 

In  both  models,  the  execution  of  a  program  is  split 
in  several  processes  that  are  executed  in  parallel.  In 
most  cases,  on  each  processor  only  one  process  is  ex¬ 
ecuted  and  therefore  we  will  use  the  term  ‘processor’ 
in  the  discussion  below,  although  ‘process’  would  of¬ 
ten  be  more  precise. 
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2.2.1  Shared  memory  programming  model 

Regardless  of  the  physical  organisation  of  the  mem¬ 
ory,  the  shared  memory  programming  model  is  based 
on  the  existence  of  a  common  or  global  address  space, 
i.e.  every  processor  can  address  every  memory  lo¬ 
cation.  Thus  processors  communicate  by  accessing 
(writing,  reading)  shared  data.  The  time  to  access 
the  shared  data  may  differ  very  much,  depending  on 
the  physical  location  of  the  shared  data  (cache,  local 
memory,  another  processor’s  memory). 

Also,  situations  are  possible  where  different  proces¬ 
sors  wish  to  use  a  part  of  the  common  memory  simul¬ 
taneously.  In  that  case  synchronisation  of  the  pro¬ 
cesses  is  necessary.  Synchronisation  is  also  needed 
before  a  sequential  part  of  a  program,  in  order  to 
assure  that  all  processors  have  finished  their  parallel 
actions  prior  to  this  sequential  part. 

Thus  the  performance  may  deteriorate  substantially 
due  to  ‘memory  conflicts’,  synchronisation  and  ‘se¬ 
quential  bottlenecks’.  Further,  the  overhead  associ¬ 
ated  with  the  creation  of  (parallel)  tasks  can  be  very 
high. 

At  present,  Fortran  and  C  compilers  exist  that  per¬ 
form  automatic  parallelisation  in  the  shared  memory 
programming  model.  The  programmer  can  influence 
the  parallelisation  via  directives  (as  for  vector  pro¬ 
cessing).  Parallelisation  can  be  carried  out  at  loop 
level  or  at  task  (or  routine)  level,  also  called  ‘fine 
grained’,  resp.  ‘coarse  grained’  parallelism,  or  ‘micro- 
tasking’  resp.  ‘macro-tasking’. 

2.2.2  Distributed  memory  programming 
model 

In  the  distributed  memory  programming  or  message 
passing  model,  processors  can  only  access  their  own 
private  memory.  Whenever  a  processor  needs  data 
that  reside  in  the  memory  of  another  processing 
node,  the  data  must  be  sent  between  the  process¬ 
ing  nodes.  Such  a  message  passing  or  communica¬ 
tion  step  involves  preparation  of  the  message  in  the 
sending  node,  transmission  over  the  communication 
network  and  reception  of  the  message  in  the  destina¬ 
tion  node.  When  the  message  passing  model  is  used 
on  a  shared  memory  system,  the  actual  transmission 
is  replaced  by  storage  of  the  information  in  shared 
memory. 

Also  in  a  distributed  memory  model,  synchronisation 
problems  can  occur.  It  is  possible  that  a  processor 
does  not  have  the  data  available  yet  at  the  moment 
they  are  needed  by  another  processor  ;  at  this  syn¬ 
chronisation  point  the  processor  has  to  wait  for  the 
other  processor  to  catch  up.  Synchronisation  may 
also  be  needed  to  assure  that  the  communication  be¬ 
tween  processors  proceeds  in  a  correct  way. 

Although  each  processor  can  execute  a  different  pro¬ 
gram,  most  often  the  ‘Single  Program,  Multiple 
Data’  (SPMD)  programming  style  is  used:  all  pro¬ 
cessors  execute  the  same  program  acting  on  different 
parts  of  the  data  set.  This  requires  an  appropriate 
partitioning  (distribution)  of  the  data  of  the  data  and 


of  the  operations  that  have  to  be  performed  on  them. 
The  partitioning  of  the  data  must  be  so  that  the  work 
load  is  well  balanced  between  processors  and  so  that 
communication  and  synchronisation  is  minimised. 

Programming  in  the  distributed  memory  model  is 
often  more  difficult  than  programming  in  the  shared 
memory  model.  The  programmer  must  be  aware  of 
the  location  of  the  data  in  the  local  memories  and 
has  to  move  or  distribute  these  data  explicitly  when 
needed.  The  partitioning  of  the  data  and  all  nec¬ 
essary  communication  has  to  be  included  explicitly 
into  the  program.  A  sequential  program  often  needs 
significant  changes  in  order  to  parallelise  it. 

Distributed  memory  programs  are  written  in  conven¬ 
tional  languages  (Fortran,  C,  C-f4-,  . . . )  and  a  com¬ 
munication  library  is  used  to  implement  the  commu¬ 
nication  and  synchronisation  operations.  Basic  com¬ 
munication  routines  allow  messages  to  be  sent  and 
received  between  arbitrary  processing  nodes.  Incom¬ 
ing  messages  are  normally  buffered  by  the  operating 
system  at  the  destination  node  until  the  application 
program  requests  the  message.  Also  various  ‘higher 
level’  routines  are  provided,  e.g.  for  ‘global  opera¬ 
tions’  on  a  set  of  data  distributed  across  the  nodes 
(broadcast,  global  sum,  global  maximum,  . . . )  and 
for  synchronisation. 

In  addition  to  the  machine  dependent  communica¬ 
tion  libraries,  several  machine  independent  libraries 
have  been  developed.  Widely  used  libraries  are 
PVM  [2],  MPI  [3],  FARM  ACS  [4].  Some  of  these 
libraries  or  environments  (e.g.  PARMACS)  contain 
utility  routines  that  perform  automatic  partitioning 
and  mapping  of  vectors  and  matrices,  and  facili¬ 
ties  for  performance  monitoring  and  analysis.  The 
PVM  environment  provides  facilities  to  use  a  (het¬ 
erogeneous)  network  of  workstations  as  a  distributed 
memory  parallel  computer. 

Compilers  and  software  tools  that  perform  (semi)  au¬ 
tomatic  parallelisation  for  DM-MIMD  machines  are 
becoming  available  now.  High  Performance  Fortran 
is  a  set  of  extensions  to  Fortran  90  for  writing  parallel 
applications  [5].  HPF  includes  features  for  mapping 
multi-dimensional  arrays  (i.e.  structured  data  sets) 
to  parallel  processors  and  for  specifying  data  parallel 
operations.  Extensions  to  HPF  are  being  developed 
that  offer  a  similar  functionality  for  more  complex 
data  structures,  e.g.  multi- block  grids  [6].  FORGE 
90  is  a  software  tool  for  the  analysis  and  the  (semi) 
automatic  parallelisation  of  existing  sequential  codes: 
based  on  a  user  defined  partitioning  of  the  data  ar¬ 
rays,  FORGE  allows  interactive  or  automatic  selec¬ 
tion  of  do-loops  to  be  parallelised  [7]. 


2.3  Description  of  some  parallel  sys¬ 
tems 

Intel  Paragon.  The  Intel  Paragon  is  a  distributed 
memory  system  in  which  the  processing  nodes  are 
interconnected  in  a  2D  mesh  network,  see  Fig.  2.  A 
Paragon  system  with  1874  processors  is  operational 
at  Sandia  Nat.  Labs.  Two  types  of  processing  nodes 
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are  available,  both  based  on  the  Intel  i860  processor. 
General  Purpose  nodes  contain  2  processors  (one  for 
calculation  and  one  for  communication)  and  an  I/O 
expansion  port.  Multiprocessor  nodes  contain  two 
processors  for  calculation  (with  shared  memory)  and 
one  for  communication.  Wormhole  message  passing 
through  the  network  is  carried  out  by  Mesh  Router 
Chips,  one  for  each  node.  The  processing  nodes  are 
logically  divided  into  a  compute  partition  (for  paral¬ 
lel  program  execution),  an  I/O  partition  (nodes  des¬ 
ignated  to  disk  I/O  and  networking)  and  a  service 
partition  (interactive  use,  compilation). 

The  distributed  operating  system  provides  a  sin¬ 
gle  system  image  (single  process  ID  space,  single 
file  system,  etc.)  and  automatic  scheduling  of  jobs. 
The  distributed  memory  programming  model  is  sup¬ 
ported  via  Intel’s  NX  communication  system  or  via 
the  SUNMOS  environment  (Sandia).  The  Parallel 
Development  Environment  contains  various  tools  for 
software  development  and  performance  monitoring. 
For  more  information,  see  e.g.  [8] . 

Cray  T3D.  The  Cray  T3D  is  a  distributed  mem¬ 
ory  parallel  system  with  32  to  2048  processing  nodes. 
The  processing  nodes  (DEC  Alpha  processors)  are 
connected  by  a  bidirectional  3D  torus  (periodic 
mesh)  network  (each  switch  of  the  network  is  shared 
by  two  nodes),  see  Fig.  3.  Various  mechanisms  are 
implemented  to  reduce  the  communication  cost  over 
the  interconnection  network  and  to  synchronise  pro¬ 
cessing  nodes.  The  memory  is  physically  distributed, 
but  is  globally  addressable.  Hence,  three  program¬ 
ming  models  are  supported:  SIMD  (date  paral¬ 
lel),  shared  memory  MIMD  and  distributed  memory 
MIMD  programming  styles.  The  software  environ¬ 
ment  includes  a  Fortran  compiler  with  Fortran  90 
features  (array  syntax,  etc.)  which  allows  the  user  to 
mix  all  three  programming  models  in  one  program. 
Also  included  are  PVM,  a  performance  analyser,  etc. 
The  T3D  system  needs  a  Cray  vector  processor  as 
host  system. 

IBM  SP2.  The  IBM  SP2  is  a  distributed  memory 
system  with  up  to  128  processing  nodes.  Two  types 
of  processing  nodes  are  available,  ‘thin  nodes’  and 
‘wide  nodes’,  both  based  on  the  POWER2  proces¬ 
sor.  Wide  nodes  allow  larger  memories,  provide  a 
faster  processor-to-memory  connection  and  allow  to 
attach  various  storage  devices.  The  nodes  are  inter¬ 
connected  by  a  ‘High-Performance  Switch’,  see  Fig. 
4.  The  switch  is  a  multi-stage  omega  network  that 
performs  wormhole  routing.  The  available  communi¬ 
cation  bandwidth  over  the  switch  scales  linearly  with 
the  number  of  processors.  Support  for  short  mes¬ 
sages  with  low  latency  and  minimal  message  over¬ 
head  is  provided.  For  more  information,  see  e.g.  [9]. 
The  AIX  Parallel  Environment  contains  a  Message 
Passing  Library  (MPL),  performance  monitoring  and 
visualisation  tools.  An  optimised  version  of  PVM 
is  also  available.  Only  the  distributed  memory  pro¬ 
gramming  model  is  supported.  Job  scheduling  sup¬ 
port  is  provided  by  the  ‘Loadleveler’  software. 

Silicon  Graphics  Power  Challenge.  The  Silicon 
Graphics  Power  Challenge  systems  are  shared  mem¬ 
ory  multiprocessors,  with  up  to  18  processors  (MIPS 


R8000),  having  multiple  functional  units,  that  can 
operate  simultaneously.  Each  processor  has  a  cache 
hierarchy  with  a  small,  fast  on-chip  cache  and  a  large, 
slower  but  pipelined  off-chip  cache.  The  main  mem¬ 
ory  can  be  up  to  8-way  interleaved. 

The  Fortran  and  C  compilers  are  able  to  restructure 
programs  to  reduce  cache  misses  by  interchanging 
loops,  by  ‘tiling’  or  ‘blocking’  in  case  of  nested  loops, 
etc.  (Loop  blocking  is  a  technique  for  optimising 
the  performance  of  the  memory  hierarchy,  in  case  of 
e.g.  operations  on  matrices.)  Further,  the  compilers 
support  automatic  and  user-directed  (via  directives) 
parallelisation  of  Fortran  and  C  programs.  For  more 
information,  see  e.g.  [10]. 

Up  to  eight  Power  Challenge  systems  can  be  inter¬ 
connected  by  a  (switch-based)  communication  net¬ 
work,  forming  a  ‘CHALLENGEarray’  system.  Com¬ 
munication  over  this  network  must  be  programmed 
in  distributed  memory  style  using  message-passing  li¬ 
braries  (PVM,  MPI)  or  using  High-Performance  For¬ 
tran. 

Convex  Examplar.  The  Convex  Exemplar  con¬ 
sists  of  a  number  of  hypernodes,  connected  to  each 
other  via  a  low  latency  ring  network  with  four  in¬ 
terleaved  links.  Each  hypernode  is  a  shared- memory 
multiprocessor,  consisting  of  8  processors  (HP  PA- 
RISC  7200)  that  are  connected  to  4  memory  modules 
by  a  crossbar,  see  Fig.  5. 

The  Exemplar  programming  environment  provides 
both  shared  memory  and  distributed  memory  pro¬ 
gramming  support.  For  message  passing,  the  PVM 
communication  library  is  used.  The  shared  memory 
programming  environment  is  implemented  through 
what  is  called  ‘Global  Shared  Distributed  Virtual 
Memory’.  An  application,  programmed  in  shared 
memory  style,  can  use  processors  located  on  various 
hypernodes.  In  that  case,  three  levels  in  the  memory 
hierarchy  are  used:  the  large  cache  of  a  particular 
processor,  the  global  memory  of  the  hypernode  to 
which  the  processor  belongs  and  memories  located 
on  different  hypernodes. 

The  time  needed  to  access  data  located  on  a  differ¬ 
ent  hypernode  is  higher  than  to  access  data  within  a 
hypernode.  In  order  to  reduce  the  delay  caused  by 
using  the  ring  interconnect,  each  hypernode  contains 
a  cache  of  memory  references  made  over  the  inter¬ 
connect.  The  information  hold  in  this  cache  can  be 
used  to  locate  any  global  data  that  is  currently  en- 
cached  in  the  hypernode.  The  system  automatically 
maintains  cache  coherence  between  multiple  hypern¬ 
odes. 

2.4  Parallel  performance  parameters 

The  quality  of  a  parallel  implementation  is  often 
measured  by  the  achieved  speedup  or  efficiency. 

The  parallel  speedup  achieved  by  a  parallel  algorithm 
running  on  P  processors  is  defined  as  the  ratio  of  the 
execution  time  of  the  parallel  algorithm  on  a  single 
processor  and  the  execution  time  of  the  parallel  al¬ 
gorithm  on  P  processors.  The  parallel  efficiency  is 
equal  to  the  speedup  divided  by  P.  We  have  thus  the 
following  definitions  for  the  parallel  speedup  S{n,  P) 
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Fig.  2:  The  architecture  of  the  Intel  Paragon,  based  on  a  two- 
dimensional  interconnection  network. 


and  the  parallel  efficiency  E{n,  P) 


S{n,P)  - 


T(n,l) 

T(n,P) 


E{n,P) 


S(n,P) 

P 


Tin,!) 

PT(n,P) 


(1) 


where  n  denotes  the  problem  size,  T(n,  1)  and 
T(n,  P)  denote  the  execution  times  of  the  algorithm 
on  one  and  P  processors  respectively. 

Note  that  (1)  does  not  give  any  information  about 
the  quality  of  the  parallel  algorithm.  It  solely  mea¬ 
sures  how  well  an  algorithm  has  been  parallelised. 
As  such,  it  should  always  be  complemented  with 
data  which  indicate  the  numerical  efficiency  of  the 
parallel  algorithm,  which  can  be  defined  as  the  ra¬ 
tio  of  the  following  single  processor  execution  times; 
Tbest{n)/T{n,l),  where  Tbestin)  denotes  the  time 
taken  by  one  processor  of  the  parallel  computer  exe¬ 
cuting  the  fastest  known  sequential  algorithm.  Com¬ 
bination  of  the  definitions  of  parallel  speedup  (or  ef¬ 
ficiency)  and  numerical  efficiency  leads  to  the  notion 
of  total  speedup  and  total  efficiency,  defined  by 


S{n,P) 


T(u,P) 


E{n,p) 


S(n,P)  _  Tbesti'’^) 

P  —  PT{n,P)  * 


(2) 


Practical  considerations  limit  the  usefulness  of  the 
latter  definitions.  First  of  all,  it  is  often  very  difficult 
to  determine  what  algorithm  is  the  best  sequential 
one;  this  may  depend  on  the  problem  size  n,  on  the 
particular  hardware  used,  on  implementation  issues, 
etc.  Moreover,  the  notion  of  ‘best’  algorithm  may 
change  in  time,  as  better  algorithms  become  avail¬ 
able.  Also,  a  good  implementation  of  that  algorithm 
is  not  always  available.  In  practice  one  can  define 
the  total  speedup  by  using  the  execution  time  of  a 
good  sequential  algorithm  instead  of  Tbestin). 


sor  machine,  we  obviously  have  that  S{n,  P)  <  P  and 
E{n,P)  <  100%.  We  now  enumerate  some  overheads 
that  may  cause  a  deviation  from  linear  speedup. 

•  the  sequential  fraction.  The  speedup  achiev¬ 
able  on  a  parallel  computer  can  significantly  be  lim¬ 
ited  by  the  existence  of  a  small  fraction  of  inherently 
sequential  code  which  cannot  be  parallelised.  This  is 
expressed  by  Amdahl’s  law,  see  e.g.  [11]: 


Let  a  be  the  fraction  of  operations  in  a  com¬ 
putation  that  must  be  performed  sequen¬ 
tially,  where  0  <  a  <  1.  The  maximum 
speedup  achievable  by  a  parallel  computer 
with  P  processors  is  then  limited  as  follows, 


S{n,P)  < 


1 

a  -b  (1  -  a)/P 


(3) 


For  example,  when  10%  of  the  code  must  be  executed 
sequentially,  the  maximum  speedup  is  limited  by  10, 
independent  of  the  number  of  processors  available. 

Amdahl’s  law  has  been  a  central  argument  of  people 
doubting  the  usefulness  of  massively  parallel  systems. 
Their  criticism  is  justified  as  long  as  one  considers 
solving  a  particular  problem  of  a  fixed  size  (i.e.,  with 
a  constant  value  of  a).  In  actual  practice,  however, 
this  is  rarely  the  case,  as  problem  sizes  tend  to  scale 
with  the  number  of  processors  and  with  the  comput¬ 
ing  power  available.  (Large  scale  parallel  systems  are 
used  to  solved  bigger  problems  than  the  ones  solved 
on  small-scale  parallel  systems.) 

For  many  computational  problems  the  sequential 
fraction  a  rapidly  goes  to  zero  as  the  problem  size 
increases.  Consequently,  when  problem  scaling  is 
in  effect,  a  depends  on  the  number  of  processors, 
and  (3)  looses  much  of  its  significance.  An  alterna¬ 
tive  to  Amdahl’s  law  was  formulated  by  Gustafson 
et  al.  [12]  [13]. 


If  we  assume  that  a  P-processor  machine  cannot  ex-  Let  a  denote  the  sequential  fraction  of  the 

ecute  more  than  P  times  faster  than  a  single  proces-  time  spent  during  a  computation  on  a  par- 
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Fig.  3:  The  architecture  of  the  Cray  T3D  (three-dimensional  inter¬ 
connection  network). 


Fig.  4:  Left;  A  16-node  bidirectional  multi-stage  network,  forming 
the  basic  building  block  (frame’)  for  the  High-Performance 
Switch  in  the  IBM  SP2.  Right:  Twelve  frames  are  used  to 
interconnect  128  processors. 


Fig.  5:  The  architecture  of  the  Convex  Exemplar  :  logical  system  view 
(left)  and  physical  system  view  (right). 


allel  system  with  P  processors.  The  maxi¬ 
mum  speedup  achievable  is  then  limited  as 
follows, 

S'{n,P)  <  Pil- a)  +  a.  (4) 

S'{n,P)  is  usually  called  the  scaled  speedup.  It 
is  equal  to  the  ratio  T'{n,l)  over  T’{n,P),  where 
T'{n,  1)  is  the  time  the  parallel  program  would  take 
to  run  on  a  single  processor  if  sufficient  resources 
(memory)  were  available. 

In  large  scale  applications,  a  is  often  a  small  num¬ 
ber,  and  very  high  scaled  speedups  are  attainable  on 
large-scale  parallel  processors.  Fig.  6  shows  the  de¬ 
pendence  of  S{n,P)  and  S’{n,P)  on  the  serial  frac¬ 
tion  a,  resp.  a. 

•  non-optimal  algorithms  and  algorithmic 
overhead.  The  best  sequential  algorithm  may  of¬ 
ten  be  difficult  or  impossible  to  parallelise  (e.g., 
Thomas  algorithm  for  solving  tridiagonal  linear  sys¬ 
tems).  In  that  case  the  parallel  algorithm  may  have  a 
larger  operation  count  than  the  sequential  one.  Ad¬ 
ditionally,  in  order  to  avoid  communication  overhead 
one  may  wish  to  duplicate  some  calculations  on  dif¬ 
ferent  processors,  rather  than  having  one  processor 
doing  the  calculation  and  then  distributing  the  result 
(e.g.  ‘double  flux  calculations’,  see  further). 

•  software  overhead.  Parallelisation  often  results 
in  an  increase  of  the  (relative)  software  overheads 
such  as  the  overheads  associated  with  indexing,  pro¬ 
cedure  calls,  etc.  Also,  this  approach  usually  results 
in  shorter  loops,  thus  restricting  vector  lengths.  This 
reduces  the  potential  gain  of  using  vectorisation. 

•  load  imbalance.  The  execution  time  of  a  par¬ 
allel  algorithm  is  determined  by  the  execution  time 
of  the  processor  having  the  largest  amount  of  work. 
As  soon  as  the  computational  workload  is  not  evenly 
distributed,  load  imbalance  will  result,  and  proces¬ 
sor  idling  will  occur  ;  processors  must  wait  for  other 
processors  to  finish  a  particular  computation. 


•  communication  and  synchronisation  over¬ 
head.  Finally,  any  time  spent  in  communication 
and  synchronisation  is  pure  overhead. 

In  the  next  section,  we  will  discuss  in  detail  these 
various  sources  of  overhead. 

3  PARALLELISATION  OF 

GRID-ORIENTED  PROBLEMS 

3.1  Introduction 

In  the  remainder  of  this  text,  we  will  focus  on  dis¬ 
tributed  memory  parallelism  for  two  reasons.  Firstly, 
parallel  systems  with  only  distributed  memory  sys¬ 
tems  have  an  ‘extreme’  parallel  architecture.  Sec¬ 
ondly,  in  the  distributed  memory  programming 
model,  the  parallelism  must  be  introduced  explicitly 
in  the  application  program.  Algorithms  designed  for 
distributed  memory  systems  will  also  perform  well  on 
shared  memory  (or  hybrid)  systems.  Data  partition¬ 
ing,  which  is  necessary  for  distributed  memory  sys¬ 
tems,  is  also  beneficial  for  shared  memory  systems. 
For  example,  entire  matrices  typically  do  not  fit  into 
the  cache.  The  performance  of  the  memory  hierarchy 
can  be  optimised,  by  decomposing  the  matrix  oper¬ 
ations  into  submatrix  operations,  with  a  submatrix 
size  chosen  so  that  the  operands  fit  in  the  cache. 

The  basic  issues  of  parallel  algorithm  design  are 
nowadays  well  understood  and  are  described  in  var¬ 
ious  books  and  papers.  The  book  of  G.  Fox  et 
al.  [14]  is  a  key  reference  (although  somewhat  out¬ 
dated).  The  textbooks  of  E.  Van  de  Velde  [15] 
and  C.  de  Moura  [16]  also  provide  a  good  intro¬ 
duction.  The  proceedings  of  the  yearly  confer¬ 
ences  on  Parallel  Computational  Fluid  Dynamics 
give  an  overview  of  research  and  achievements  in  this 
field  [17,  18, 19].  Also  the  proceedings  of  the  Scalable 
High  Performance  Computing  Conferences  [20,  21], 
the  SIAM  Conferences  on  Parallel  Processing  for  Sci¬ 
entific  Computing  [22,  23]  and  the  HPCN  conferences 
[24,  25]  are  valuable  sources  of  information. 
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Fig.  6:  Dependence  of  the  speedup  on  the  sequential  fraction  for  P  — 
128  and  P  =  1024.  Left  :  parallel  speedup  S{n,P)  ;  right  : 
scaled  speedup  S'{n,P). 


For  grid-oriented  problems,  such  as  the  numerical  so¬ 
lution  of  partial  differential  equations,  the  data  are 
defined  on  a  discrete  grid  of  grid  points  or  finite  vol¬ 
umes  or  finite  elements.  In  this  paper,  we  will  use 
the  term  (grid)  point  as  a  generic  name  for  a  grid 
point,  finite  volume  or  element  and  its  data. 

Parallelisation  of  grid  oriented  applications  is  seri¬ 
ously  facilitated  because  the  calculations  on  a  grid 
point  typically  involve  only  grid  points  that  are  ge¬ 
ometrically  adjacent.  Parallelisation  is  achieved  by 
partitioning  the  grid  into  subdomains  (subgrids)  and 
assigning  these  subdomains  to  the  processors  of  the 
parallel  system.  Each  processor  performs  the  calcu¬ 
lations  associated  with  the  subdomain(s)  assigned  to 
that  processor.  Dependency  (and  communication) 
between  subdomains  is  restricted  to  the  perimeters 
of  the  subdomains. 


Many  important  issues  concerning  parallelisation  of 
grid-oriented  problems  and  performance  analysis  of 
parallel  algorithms  can  be  understood  by  studying 
the  parallel  execution  of  a  ‘model  problem’,  repre¬ 
senting  the  explicit  time-integration  of  a  finite  differ¬ 
ence  or  finite  volume  discretisation  of  a  partial  dif¬ 
ferential  equation  on  a  structured  grid. 


Assume  that  a  2D  structured  grid  is  partitioned  in 
subdomains  of  equal  size,  such  that  each  processor 
deals  with  x  Uy  grid  points  or  cells.  Assume  fur¬ 
ther  that  the  explicit  time-integration  is  based  on  a 
five-point  stencil,  i.e. 


,(fc+i)  =  f(Ak)  (k)  (k)  jk)  , 

‘‘n  J  1  “i-l-lj )  “ij-1  >  ^ij+1 ) 


Due  to  the  local  nature  of  the  calculations,  each 
processor  can  perform  the  updates  for  all  interior 
grid  points  (the  white  area  in  Fig.  7).  The  other 
grid  points  of  the  subdomain  are  called  (subdomain) 
‘boundary  grid  points’.  In  order  to  perform  the  up¬ 
dates  of  the  subdomain  boundary  grid  points,  the 
processor  must  know  also  function  values  corre- 

bj 

spending  to  grid  points  lying  at  the  other  side  of  the 
subdomain  boundaries.  This  information  must  be  re¬ 
ceived  from  the  neighbouring  processors  and  can  be 
stored  in  the  overlap  regions  indicated  in  Fig.  7.  On 
the  other  hand,  the  boundary  grid  points  must  be 
send  to  neighbouring  processing  nodes,  where  they 
are  part  of  the  overlap  region.  Hence,  before  each 
integration  step,  neighbouring  processors  exchange 
information  with  each  other,  see  Fig.  8. 
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Fig.  7:  Grid  partitioning  and  communication  requirements. 


Fig.  8:  Concurrent  exchange  of  local  boundaries. 


3.2  Analysis  of  the  communication 
overhead 


E{n,P) 


1 

1  +  /c 


For  the  simple  model  problem  described  above,  the 
execution  time  of  the  algorithm  for  a  problem  size 
denoted  by  n,  on  a  parallel  system  with  P  processors 
can  be  written  as 


T{n,  P)  =  Tcalc  +  Tcomm 

where  Tcalc  denotes  the  calculation  time  and  Tcomm 
denotes  the  time  spent  in  communication.  Assuming 
that  no  other  overhead  occurs  except  communication 
of  the  overlap  regions,  the  sequential  execution  time 
is 

T(n,  1)  =  P  •  Tcalc 

Hence  the  speedup  and  parallel  efficiency  are  given 
by 


S(n,P)  = 


PTcalc 


where  fc  =  Tcomm  ITcalc  denotes  the  communication 
overhead.  If  fc  is  small,  a  nearly  optimal  speedup 
(5(n,P)  ~  P)  and  efficiency  {E{n,P)  ~  1)  is  ob¬ 
tained. 

The  amount  of  data  sent  and  received  per  processor  is 
proportional  to  the  number  of  boundary  cells,  while 
the  amount  of  computations  performed  by  each  pro¬ 
cessor  is  proportional  to  the  number  of  interior  cells. 
For  the  model  problem  we  have 

Tcalc  —  Si  nxny  tcalc 

Tcomm  —  C2  '  2(tIx  "b  Uy)  tcomm 

where  tcalc  represents  the  time  required  to  perform 
a  floating  point  operation,  tcomm  denotes  the  time 
needed  to  communicate  one  floating  point  number, 
and  Cl,  C2  are  constants.  This  leads  to  the  important 
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formula 

C2  2(n^  + 

’^y)  tcomm  ,-s 

Jc  —  7  (o) 

C\  Ux  X  Tly  tcalc 

which  indicates  that  the  overhead  depends  on  3  fac¬ 
tors: 

1.  the  size  of  the  subdomain:  large  subdomains 
have  a  small  ‘perimeter  to  surface’  ratio  2{nx  4- 
ny)/nxny,  leading  to  a  small  value  for  fc] 

2.  the  machine  characteristic  tcomm/ tcalc,  indicat¬ 
ing  how  fast  communication  can  be  performed 
compared  with  floating  point  operations; 

3.  the  algorithm  via  the  ratio  C2/C1.  The  overhead 
fc  will  be  small  for  problems  for  which  many 
floating  point  operations  per  grid  point  must  be 
performed  (ci  large),  compared  with  the  amount 
of  data  to  be  communicated  per  grid  point  (rep¬ 
resented  by  C2). 

Remark.  An  important  characteristic  of  most 
communication  systems  is  the  rather  high  message 
startup  time.  The  cost  of  sending  a  message  between 
neighbouring  processors  can  be  written  as 

T(7x)  tgicLvtup  T  n  tgcxid 

where  n  indicates  the  length  of  the  message  (number 
of  words  transferred),  t startup  is  the  message  startup 
time  (caused  by  hardware  and  software  delays)  and 
tsend  is  the  marginal  communication  time  per  word. 
For  many  systems  tstartup  is  much  larger  than  tsend 
(even  by  a  factor  1000).  An  immediate  conclusion  is 
that  sending  many  short  messages  should  be  avoided 
if  possible. 

In  (5)  tcomm  denotes  the  average  time  to  communi¬ 
cate  one  word.  This  clearly  depends  very  much  on 
the  average  length  of  the  messages  that  are  sent  :  for 
small  messages  tcomm  —  tstartup,  while  for  very  large 
messages  tcomm  —  tsend-  This  must  be  taken  into 
account  when  analysing  parallel  algorithms  by  using 
(5). 

A  further  analysis  of  this  model  problem  reveals  some 
important  guidelines  that  should  be  taken  into  ac¬ 
count  when  parallelising  CFD  algorithms. 

Different  grid  partitioning  strategies.  For  this 
model  problem,  the  communication  volume  is  pro¬ 
portional  to  the  number  of  grid  points  on  the  (inte¬ 
rior)  subdomain  boundaries,  i.e.  proportional  to  the 
‘perimeter’  of  the  subdomain.  When  the  size  of  the 
subdomain  is  flxed,  the  perimeter  (and  thus  the  com¬ 
munication  volume)  is  minimal  if  the  number  of  grid 
points  in  each  direction  is  equal,  i.e.  Ux  =  Uy.  We 
will  use  the  term  ‘square  subdomain’  to  denote  the 
latter  case.  Hence,  partitioning  into  square  subdo¬ 
mains  leads  to  a  minimal  communication  volume. 

This  observation  can  be  generalised  as  follows.  A 
stripwise  (or  one-dimensional)  partitioning  (Fig.  9, 
left)  yields  subdomains  with  long  boundaries  but 
with  at  most  two  adjacent  subdomains.  A  blockwise 
(or  two-dimensional)  partitioning  (Fig.  9,  right)  gives 
subdomains  with  shorter  boundaries  but  with  up  to 


four  neighbours.  Thus  a  blockwise  partitioning  min¬ 
imises  the  total  communication  volume,  while  a  strip- 
wise  partitioning  minimises  the  number  of  messages. 
What  will  be  the  best  choice  depends  on  the  char¬ 
acteristics  of  the  problem  and  of  the  parallel  com¬ 
puter.  When  the  message  startup  time  dominates 
the  communication  time  per  message  the  stripwise 
partitioning  will  be  beneficial. 

Note  that  the  communication  requirements  are  not 
always  ‘isotropic’  in  all  directions,  but  they  may  de¬ 
pend  on  characteristics  of  the  problem  or  the  numer¬ 
ical  algorithm.  This  may  influence  the  partitioning 
strategy.  Consider  for  example  the  solution  of  the 
compressible  Navier-Stokes  equations  around  an  air¬ 
foil.  The  inclusion  of  an  algebraic  turbulence  model 
may  lead  to  a  global  dependence  (and  communica¬ 
tion)  in  the  direction  perpendicular  of  the  airfoil. 
Then  a  stripwise  partitioning  is  to  be  preferred. 

Dependence  on  the  size  of  the  subdomains. 
When  each  subdomain  contains  N  =  Ux  X  Uy  grid- 
points  and  a  blockwise  partitioning  is  used  with 
square  subregions  (rij,  =  Uy),  Eq.  (5)  yields 

This  indicates  that  the  communication  overhead  fc 
remains  constant,  independent  of  the  number  of  pro¬ 
cessors,  as  long  as  the  size  of  the  subdomains  remains 
constant  !  Of  course  this  implies  that  to  maintain  a 
given  parallel  efficiency,  the  total  problem  size  M 
must  grow  when  the  number  of  processors  P  grows, 
since  M  —  N  x  P.  The  relation  fc  oc  also  in¬ 
dicates  that,  for  fixed  (total)  problem  size  M,  the 
efficiency  and  speedup  decrease  when  the  number  of 
processors  increases  (cf.  Amdahl’s  law).  This  analy¬ 
sis  is  only  valid  when  the  only  communication  is  the 
exchange  of  information  between  neighbouring  pro¬ 
cessors.  Any  ‘global  communication’  (e.g.,  the  col¬ 
lection  of  the  local  residuals  to  compute  the  global 
residual)  implies  an  overhead  which  grows  with  in¬ 
creasing  number  of  processors.  However,  the  rela¬ 
tive  importance  of  such  global  communications  is  of¬ 
ten  very  low  and  does  not  really  affect  the  overall 
speedup  and  efficiency. 

Extension  to  larger  stencils  and  to  3D  grids. 

The  analysis  presented  above  remains  valid  when 
other  computational  stencils  are  used  instead  of  a 
5-point  stencil  [14].  It  may  be  necessary  to  use  a 
larger  overlap  region,  (e.g.,  with  a  width  of  2  points). 
In  that  case  the  communication  volume  increases 
(and  the  constant  C2),  but  the  number  of  operations 
per  cell  (and  thus  the  constant  ci)  also  increases. 
Hence,  the  communication  overhead  does  not  neces¬ 
sarily  grow. 

In  case  of  three-dimensional  grids,  ID-,  2D-  and  3D- 
partitionings  can  be  used.  The  communication  vol¬ 
ume  is  then  determined  by  the  ‘surface  to  volume’ 
ratio  of  the  subdomains,  leading  to  a  factor  in 
Eq.  (6),  see  e.g.  [14]. 

Dependence  on  the  machine  characteristics. 
The  speedup  and  parallel  efficiency  of  a  given  al- 


1-12 


Fig.  9:  Strip-  and  blockwise  partitioning  of  a  grid. 


gorithm  is  proportional  to  the  machine  characteris¬ 
tic  t  commit  calc-  The  various  parallel  systems  avail¬ 
able  have  quite  different  values  for  this  character¬ 
istic.  Thus  the  communication  overhead  may  vary 
substantially  on  different  machines. 

Note  that  computer  manufacturers  may  upgrade  al¬ 
ternatively  the  processors  and  the  communication 
network  of  their  systems.  Upgrading  the  processors 
without  also  increasing  the  communication  speed, 
may  result  in  an  ‘unbalanced’  system  with  a  large 

ratio  tcomm ! t-calc- 

Dependence  on  the  problem  characteristics. 
Computational  Fluid  Dynamics  applications  are 
characterised  by  a  high  number  of  floating  point  op¬ 
erations  per  grid  point  or  cell  per  iteration,  while 
only  a  few  variables  are  associated  with  each  point. 
Thus  the  factor  C2 /ci  in  the  communication  overhead 
will  be  small. 

As  a  result,  minimisation  of  the  communication  over¬ 
head  does  not  always  influence  the  speedup  and  par¬ 
allel  efficiency  very  much.  However,  it  is  always  im¬ 
portant  to  minimise  the  work  load  imbalance.  Below 
we  show  that  in  general  a  blockwise  partitioning  also 
minimises  the  load  imbalance.  Thus  in  many  cases 
minimisation  of  communication  overhead  and  min¬ 
imisation  of  the  load  imbalance  go  hand  in  hand. 

3.3  Analysis  of  the  load  imbalance 

Let  =  1 . .  .P,  denote  the  time  spent  by  the 

i-th  processor  in  calculation,  and  let  T^^irage 
'^mlx  denote  respectively  the  average  and  the  maxi¬ 
mum  calculation  time  for  the  P  processors.  The  load 
balance  factor  is  defined  as 

'T'calc 

(7) 

The  load  balance  factor  is  a  good  estimate  for  the 
parallel  efficiency,  if  the  number  of  operations  to  be 
performed  (counted  sequentially)  does  not  depend  on 
the  number  of  processors,  and  if  the  communication 


time  can  be  neglected.  Indeed,  in  this  case  the  par¬ 
allel  efficiency  is  given  by 

1  \  V'^  rpcalc  rpcalc 

P(r,  Tit  =  ^  ~  i  average 

^  PT{n,P)^  PmaxTf^‘^ 

and  thus  E{n,p)  w  X{n,p). 

Note  that  a  commonly  made  mistake  is  to  measure 
load  (im)balance  by  comparing  the  maximum  and 
the  minimum  calculation  times. 

In  many  applications,  the  processors  are  (implicitly) 
synchronised  by  the  communication  needed  to  up¬ 
date  the  ‘overlap  regions’  at  the  end  of  each  step  of  an 
iterative  procedure.  In  that  case,  we  can  determine 
the  efficiency  and  speedup  by  analysing  one  iteration 
step.  Assume  now  that  the  amount  of  work  per  grid 
point  is  constant,  and  that  the  communication  time 
can  be  neglected.  We  then  obtain 

jp/  p\  _  1) _ M  _  Nqyerage 

^  ~  P  T{n,P)  ~  P  Nmax  Nmax 

where  M  is  the  total  number  of  grid  points,  Nmax  is 
the  maximum  number  of  grid  points  in  a  subdomain 
and  ^average  —  Xd f  P . 

Assume  that  a  rectangular  grid  is  distributed  among 
P  processors.  If  the  grid  cannot  be  equally  dis¬ 
tributed  among  the  processors,  then  a  blockwise  par¬ 
titioning  leads  to  a  higher  load  balance  factor  than 
a  stripwise  partitioning.  A  partitioning  into  square 
subdomains  will  lead  to  a  maximal  load  balance  fac¬ 
tor,  i.e.  a  minimal  load  imbalance. 

The  treatment  of  boundary  conditions  is  also  a  po¬ 
tential  source  of  load  imbalance.  Indeed,  in  general 
the  computational  work  to  be  done  for  boundary  cells 
differs  from  the  work  for  interior  cells.  In  order  to 
minimise  the  load  imbalance  caused  by  the  treat¬ 
ment  of  the  boundary  conditions,  the  boundary  cells 
should  be  distributed  as  equal  as  possible  among  the 
processors.  This  is  achieved  when  the  subdomains 
are  (nearly)  square. 

The  assumption  that  the  amount  of  work  per  grid 
point  or  cell  is  constant  is  not  always  valid  in  CFD. 
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For  example,  the  computational  effort  may  differ  for 
cells  lying  in  a  subsonic  region  and  in  a  supersonic 
region.  This  can  cause  load  imbalance,  which  cannot 
be  accurately  predicted  beforehand.  Similar  prob¬ 
lems  arise  when  the  mathematical  model  differs  in 
various  parts  of  the  domain,  e.g.  when  chemical  re¬ 
actions  are  taken  into  account  in  high-temperature 
zones. 

3.4  Numerical  efficiency  of  parallel  al¬ 
gorithms 

Until  now,  we  have  discussed  how  the  parallel  over¬ 
head  affects  the  performance,  by  comparing  the  par¬ 
allel  execution  time  with  the  time  needed  by  the  same 
algorithm  on  one  processor. 

In  many  cases  the  algorithm  used  on  the  parallel  ma¬ 
chine  is  different  from  the  one  typically  used  on  a 
sequential  machine.  In  order  to  obtain  acceptable 
parallel  efficiencies,  sequential  algorithms  are  often 
modified  to  decrease  the  communication  volume  or 
the  number  of  synchronisation  points.  This  may  de¬ 
teriorate  the  numerical  efficiency  of  the  algorithm.  It 
may  even  be  necessary  to  use  a  rather  different  algo¬ 
rithm  -  with  different  numerical  properties;  number 
of  operations,  convergence  properties,  etc.  -  on  a  par¬ 
allel  computer,  if  the  sequential  algorithm  cannot  be 
parallelised  easily  and  efficiently. 

Explicit  methods.  Explicit  methods  are  inher¬ 
ently  parallel  and  the  numerical  properties  are  not 
affected  by  parallelisation  (grid  partitioning),  when 
all  necessary  communication  is  performed.  For  ex¬ 
ample,  communication  is  needed  after  each  substep 
of  a  Runge-Kutta  method.  One  can  reduce  the  com¬ 
munication  overhead  by  updating  the  overlap  regions 
only  after  a  complete  time-integration  step.  Omit¬ 
ting  some  communication  can  result  in  slightly  worse 
convergence  properties,  but  can  lead  to  a  higher  ‘to¬ 
tal  speedup’.  The  effect  is  very  problem  depen¬ 
dent.  Note  that  this  technique  results  in  a  ‘block- 
structured’  approach,  but  here  the  number  of  blocks 
is  determined  by  the  number  of  processors,  not  by 
the  geometry  of  the  domain. 

Implicit  methods.  The  situation  is  more  complex 
when  implicit  methods  are  used. 

•  Assume  that  the  resulting  linear  systems  are 
solved  by  a  point  relaxation  scheme.  Jacobi 
relaxation  is  inherently  parallel.  In  this  case 
the  communication  requirements  are  exactly  the 
same  as  in  the  model  problem  described  above 
(exchange  of  the  overlap  regions).  Gauss- Seidel 
relaxation  usually  has  better  convergence  prop¬ 
erties.  On  a  sequential  computer,  a  Gauss-Seidel 
iteration  typically  sweeps  through  the  grid  cells 
in  lexicographic  order.  On  a  vector  or  paral¬ 
lel  computer,  a  Red-Black  ordering  of  the  grid 
points  is  necessary.  All  ‘red’  points  can  be 
updated  in  parallel,  and  afterwards  the  ‘black’ 
points  can  be  updated.  The  convergence  rate  of 
lexicographic  and  Red-Black  Gauss-Seidel  can 
differ  substantially.  This  will  be  illustrated  in 
the  section  4. 


•  When  line  relaxation  schemes  are  used,  (block) 
tridiagonal  systems  must  be  solved.  This  leads 
to  data  dependencies  between  the  grid  points 
lying  on  the  same  gridline.  If  one  only  sweeps 
in  one  direction,  the  tridiagonal  systems  —  and 
the  associated  data  dependencies  —  only  occur 
along  that  direction.  By  using  a  stripwise  par¬ 
titioning,  one  can  ensure  that  each  tridiagonal 
system  belongs  to  only  one  processor.  Then  each 
system  can  be  solved  by  the  Thomas  algorithm 
(i.e.  Gaussian  elimination),  which  is  the  optimal 
sequential  solver. 

The  parallelisation  of  line  relcixation  is  not  so 
easy,  if  a  blockwise  partitioning  is  used,  or 
if  one  performs  line  relaxation  in  different  di¬ 
rections.  Then  (part  of)  the  tridiagonal  sys¬ 
tems  are  distributed  among  processing  nodes. 
Parallel  solvers  for  (block)  tridiagonal  systems 
have  been  developed,  based  on  substructured 
Gaussian  elimination  and/or  on  cyclic  reduction 
[26]  [27].  However,  the  operation  count  of  these 
solvers  is  ±2  times  higher  than  for  the  Thomas 
algorithm  —  hence  their  numerical  efficiency  is 
low  —  and  they  contain  a  sequential  part.  Since 
many  tridiagonal  systems  must  be  solved,  the 
latter  drawback  can  be  avoided  by  distributing 
the  sequential  parts  equally  over  the  processors 
(at  the  expense  of  some  communication).  At¬ 
tempts  are  made  to  reduce  the  computational 
cost  of  the  parallel  algorithms  by  using  approx¬ 
imate  solvers  [28]  [29] . 

An  alternative  is  to  solve  the  set  of  tridiago¬ 
nal  systems  by  using  the  Thomas  algorithm  in 
a  pipelined  fashion.  This  strategy  however  re¬ 
quires  the  communication  of  many  short  mes¬ 
sages  and  leads  to  some  load  imbalance  (during 
the  start-up  and  the  end  phase  of  the  pipeline). 
Another  alternative  strategy  to  solve  tridiago¬ 
nal  systems  oriented  in  two  directions  goes  as 
follows.  We  know  that  when  a  stripwise  parti¬ 
tioning  of  the  data  is  used,  the  tridiagonal  sys¬ 
tems  oriented  in  one  of  the  two  directions  can  be 
solved  by  the  Thomas  algorithm.  The  Thomas 
algorithm  can  be  used  to  solve  the  tridiagonal 
systems  in  both  directions,  if  in  both  phases  of 
the  algorithm  a  different  stripwise  partitioning 
is  used,  such  that  in  each  phase  a  tridiagonal 
system  is  stored  in  only  one  processor.  This 
requires  that  a  complete  ‘data  transposition’  is 
carried  out  between  both  phases.  The  commu¬ 
nication  volume  of  the  data  transposition  is  pro¬ 
portional  with  the  number  of  grid  points  per  pro¬ 
cessor.  Since  the  same  holds  for  the  calculation 
cost,  the  parallel  efficiency  may  still  be  accept¬ 
able.  The  latter  strategy  is  the  most  efficient 
one  (in  terms  of  total  efficiency)  to  implement 
the  semi-implicit  ADI  time  integration  scheme 
on  finite  difference  grids  with  irregular  bound¬ 
aries  [30]. 

•  Another  example  of  the  interaction  between  nu¬ 
merical  and  parallel  aspects  can  be  found  in 
multigrid.  W-cycles  are  usually  more  efficient 
than  V-cycles  in  terms  of  work-units  needed  to 
achieve  convergence.  On  a  parallel  computer 
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however,  W-cycles  result  in  poor  parallel  effi¬ 
ciency  and  one  therefore  frequently  resorts  to 
V-cycles  despite  their  inferior  numerical  proper¬ 
ties  [31], 

•  Parallel  algorithms  for  solving  partiaJ  differen¬ 
tial  equations  can  also  be  based  on  domain  de¬ 
composition  in  the  mathematical  sense.  Two  ap¬ 
proaches  are  possible. 

In  the  Schwartz  domain  decomposition  ap¬ 
proach,  overlapping  subdomains  are  used.  The 
differential  equations  are  solved  on  each  subdo¬ 
main  separately,  using  an  approximation  for  the 
solution  at  the  subdomain  boundaries.  The  re¬ 
sulting  approximate  solutions  provide  a  new  ap¬ 
proximation  for  the  solution  on  the  boundaries 
of  the  (overlapping)  neighbouring  subdomains. 
This  process  must  be  repeated  in  an  iterative 
way. 

In  the  Schur  Complement  approach,  non¬ 
overlapping  subdomains  are  used.  The  subdo¬ 
main  problems  are  solved  in  terms  of  the  vari¬ 
ables  on  the  borders  of  the  subdomains.  After 
computation  of  the  variables  on  these  borders 
(interface  or  ‘Schur  complement’  problem),  the 
variables  on  the  subdomains  can  be  determined. 
Note  that  both  domain  decomposition  ap¬ 
proaches  often  require  extra  calculations  com¬ 
pared  to  when  no  decomposition  is  used.  These 
extra  calculations  must  be  considered  as  al¬ 
gorithmic  overhead  caused  by  the  parallelisa¬ 
tion.  Domain  decomposition  techniques  for 
CFD  problems  are  described  in  e.g.  [32,  33,  34, 
35]. 


4  EXAMPLES 

In  this  section  we  illustrate  some  of  the  concepts  in¬ 
troduced  in  the  previous  sections.  We  first  discuss 
the  parallel  performance  of  an  explicit  Euler  Solver 
on  Intel  iPSC/2  and  iPSC/860  distributed  memory 
computers.  We  show  that  ‘double  flux  calculations’, 
caused  by  the  grid  partitioning,  may  form  a  substan¬ 
tial  algorithmic  overhead  in  the  parallel  code.  We 
comment  on  various  approaches  to  measure  the  par¬ 
allel  performance  and  we  introduce  the  effectivity  as 
an  alternative  performance  measure. 

We  then  present  results  of  experiments  on  the  par¬ 
allelisation  of  implicit  Euler  solvers.  We  discuss  the 
achieved  parallel  speedup  and  parallel  efficiency,  but 
we  also  show  how  the  numerical  efficiency  of  parallel 
algorithms  may  influence  the  total  speedup  and  total 
efficiency,  which  is  a  better  measure  for  the  actual 
performance. 

Finally,  we  describe  results  obtained  with  a  block 
structured  Euler  solver,  in  which  an  adaptive  block 
refinement  procedure  leads  to  the  creation  of  new 
blocks.  We  show  that  the  use  of  mapping  heuristics 
allows  to  map  the  block  structure  onto  the  processors 
of  the  parallel  machine,  such  that  the  load  imbalance 
and  the  communication  cost  is  low. 


4.1  Parallel  performance  of  an  ex¬ 
plicit  Euler  Solver 

We  first  describe  some  experiments  with  a  parallel 
multi- block  explicit  Euler  solver  [36,  37,  38].  We  have 
used  the  following  schemes: 

Scheme  1)  a.  first  order  upwind  discretisation  with 
Van  Leer  flux  vector  splitting,  combined  with 
a  forward  Euler  time  integration  with  local 
timestepping; 

Scheme  2)  a  second  order  Roe  scheme,  with  a  min- 
mod  limiter  on  characteristic  variables,  com¬ 
bined  with  a  five  stage  Runge-Kutta  time  in¬ 
tegrator. 

In  the  parallel  version  of  the  Runge-Kutta  scheme, 
communication  occurs  only  once  per  time  step,  i.e. 
before  the  first  stage.  Scheme  2  has  a  much  higher 
ratio  of  calculation  time  to  communication  than 
scheme  1. 

We  have  used  the  following  testcase:  transonic  flow 
around  the  NACA0012  airfoil,  with  boundary  con¬ 
ditions  :  M  —  0.80  (Mach  number),  angle  of  at¬ 
tack  of  1.25°,  To  =  278K  (total  temperature)  and 
Po  =  ISOOOOPa  (total  pressure).  The  structured  C- 
grid  (240  x  19  cells)  shown  in  Fig.  11  can  be  split  in 
2,  4,  . . .,  16  blocks  of  equal  size  (ID  partitioning,  or¬ 
thogonal  to  the  airfoil),  see  Fig.  11a).  Thus  all  these 
partitionings  allow  a  nearly  perfect  calculation  load 
balance. 

Two  sets  of  tests  were  done  :  the  first  set  with  N  =  P 
blocks  and  the  second  set  with  A  =  16  blocks  regard¬ 
less  of  the  number  of  processors,  P.  The  first  case 
corresponds  to  a  situation  where  the  grid  is  parti¬ 
tioned  for  parallel  processing  purposes  only.  Extra 
calculations  caused  by  the  partitioning  must  be  con¬ 
sidered  as  algorithinic  overhead,  as  discussed  below. 
The  second  case  corresponds  to  a  situation  where  the 
partitioning  into  blocks  results  from  physical  consid¬ 
erations.  A  sequential  code  would  use  the  same  par¬ 
titioning  into  blocks. 

4.1.1  Algorithmic  overhead:  double  flux 
computations 

We  first  discuss  a  typical  ‘algorithmic  overhead’  that 
occurs  in  parallel  CFD  codes.  At  subdomain  bound¬ 
aries,  the  parallel  code  cannot  exploit  the  symmetry 
properties  of  the  numerical  fluxes.  The  fluxes  are  cal¬ 
culated  twice  on  every  edge  of  a  subdomain  bound¬ 
ary,  once  in  every  block.  If  the  grid  is  partitioned  for 
parallel  processing  purposes  only,  these  ‘double  flux 
computations’  form  an  overhead,  not  present  in  the 
sequential  code.  Experiments  indicate  that  the  time 
required  for  those  extra  flux  computations  is  of  the 
same  order  of  magnitude  as  or  even  larger  than  the 
communication  time,  see  [37]  and  Tables  1  and  ??. 

Suppose  that  the  number  of  blocks  N  is  equal  to  the 
number  of  processors  P.  The  time  lost  in  the  extra 
flux  computations  and  some  additional  overhead  in¬ 
troduced  by  splitting  the  grid  in  subdomains  is  given 
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Number  of  processors 

scheme  1 

scheme  2 

2 

0.7 

4.8 

4 

1.6 

9.0 

8 

2.4 

12.6 

16 

2.7 

14.6 

Table  1:  Double  flux  computation  overhead, 
deflned  as  {t^‘^^^{N  —  P,P)  — 
for  the  two  explicit  schemes  on  iPSC/2. 


Number  of  processors 

scheme  1 

scheme  2 

2 

4.7 

36.9 

4 

3.8 

33.3 

Table  2:  Double  flux  computation  overhead, 
defined  as  =  P,  P)  — 

for  the  two  explicit  schemes  on  iPSC/860. 


P 

II 

1  N=16 

a[P,P)  (%) 

P(P,P)  (%) 

S{P,P) 

aiie,P)  (%) 

P(16,P)  (%) 

"s(ie,P) 

1 

99.9 

100 

1 

98.7 

100 

1 

2 

99.7 

99.8 

1.995 

98.4 

99.8 

1.996 

4 

99.1 

98.5 

3.94 

98.1 

99.5 

3.98 

8 

97.9 

96.3 

7.70 

97.1 

98.4 

7.87 

16 

95.9 

90.2 

14.43 

95.9 

97.1 

15.54 

Table  3:  Effectivity  a,  efficiency  E  and  speedup  S  on  the  iPSC/2  for 
first  order  Van  Leer,  Euler  time  integrator. 


by  =  P,P)  -  where  P‘^‘‘^(N,P) 

denotes  the  total  calculation  time  on  P  processors 
for  a  grid  partitioned  into  N  blocks.  (Note  that 
tc‘^ic[N,  P)  is  equal  to  the  sum  of  the  sequential  calcu¬ 
lation  times  for  all  blocks;  communication  time  and 
processor  idle  time  is  not  taken  into  account.) 

The  double  flux  computation  overhead  on  the 
iPSC/2  for  the  two  schemes  mentioned  above  is  given 
in  Table  1.  The  results  show  that  the  time  spent  in 
the  extra  flux  calculations  in  scheme  1  (first  order 
Van  Leer,  forward  Euler)  is  of  the  same  order  of  mag¬ 
nitude  as  the  communication  time,  while  for  scheme 
2  (second  order  Roe,  Runge  Kutta)  the  double  flux 
computation  overhead  is  much  larger  than  the  com¬ 
munication  overhead.  For  a  Navier-Stokes  computa¬ 
tion,  the  double  flux  computation  overhead  would  be¬ 
come  even  more  dominant.  Table  2  shows  the  results 
obtained  on  an  iPSC/860  system,  for  which  both  cal¬ 
culation  and  communication  are  much  faster  than 
on  the  iPSC/2.  For  this  example,  the  double  flux 
computation  overhead  is  even  larger  than  on  iPSC/2 
systems.  Note  however  that  the  code  has  not  been 
optimised  for  the  cache  memory  on  the  processors  of 
the  iPSC/860.  In  an  optimised  code,  the  double  flux 
calculation  overhead  would  be  smaller. 

This  experiment  shows  that  it  does  not  always  make 
sense  to  try  to  minimise  the  communication  time.  In 
some  cases,  it  would  be  better  to  eliminate  the  double 
flux  computations  via  additional  communication  of 
the  fluxes. 


4.1.2  Parallel  performance  measurements 

We  have  measured  the  parallel  performance  on  the 
Intel  iPSC/2  of  the  explicit  Euler  solver  using  scheme 
1  (first  order,  forward  Euler),  because  this  scheme 
has  a  low  calculation  to  communication  ratio  as  com¬ 
pared  to  other  methods,  so  the  parallel  performance 
of  this  scheme  reflects  a  worst-case  situation. 

The  parallel  efflciency  E{N,P)  and  the  speedup 
S{N,P)  compare  the  execution  times  on  one  and  on 
P  processors.  Another  measure  for  the  parallel  per¬ 
formance  of  an  algorithm  can  be  defined  as  follows. 

The  effectivity  a  of  a  parallel  algorithm  is  defined  as 
the  amount  of  time  spent  in  the  actual  calculation 
relative  to  the  total  execution  time;  for  a  multi-block 
code  this  can  be  computed  as 


a{N,  P) 


P-T{N,P) 


(8) 


where  where  denotes  the  calculation  time  for 
block  i  and  where  T{N,P)  denotes  the  execution 
time  of  a  parallel  iteration  step  for  an  V-block  grid 
on  P  processors  (inch  communication).  The  effectiv¬ 
ity  takes  three  factors  into  account  :  the  load  imbal¬ 
ance,  the  communication  overhead  and  the  schedul¬ 
ing  overhead.  For  compute-intensive  problems,  one 
can  expect  that  a{N,P)  is  approximately  equal  to 
the  load  balance  factor  X{N,  P).  Note  that  one  must 
not  be  able  to  run  the  program  on  a  single  processor 
to  determine  a. 

In  Table  3  we  present  the  parallel  efficiency,  speedup 
and  effectivity  obtained  for  the  two  sets  of  tests  men- 
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tioned  above  ;  a)  N  =  P  blocks;  b)  N  =  16  blocks 
regardless  of  the  number  of  processors.  Only  in  case 
a )  the  extra  flux  calculations  are  considered  to  be  a 
loss.  Therefore,  the  parallel  efficiency  and  speedup  is 
higher  in  case  b ).  However,  lower  effectivities  are  re¬ 
ported  as  more  blocks  have  to  be  managed  and  more 
interblock  communication  occurs. 

This  comparison  stresses  the  importance  of  clearly 
stating  how  eflflciencies  and  speedups  are  measured. 
It  also  demonstrates  that  for  this  type  of  applica¬ 
tions,  a  high  parallel  efficiency  and  speedup  can  be 
obtained  when  load  imbalance  is  insignificant. 

4.2  Parallel  implicit  Euler  solvers 

4.2.1  Influence  of  the  partitioning  strategy 

In  this  section  we  report  on  some  experiments  with 
parallel  implicit  Euler  solvers.  A  first  series  of  tests 
has  been  done  with  a  solver,  based  on  a  first  order 
discretisation  with  Van  Leer  flux  vector  splitting  and 
backward  Euler  time  integration,  see  e.g.  [39].  As 
test  case,  we  have  used  the  GAMM  channel  with 
circular  bump  and  inlet  Mach  number  M  =  0.85, 
discretised  on  a  structured  grid  with  96  x  32  inte¬ 
rior  finite  volume  cells.  Treatment  of  the  bound¬ 
ary  conditions  leads,  for  each  cell  on  the  bound¬ 
ary,  to  a  vector  of  ‘boundary  unknowns’,  which  can 
be  associated  with  a  grid  point  on  the  boundary. 
Thus  the  computational  domain  consists  of  a  grid 
of  rrix  X  niy  =  98  x  34  ‘grid  points’,  to  be  distributed 
among  up  to  16  processors. 

We  have  considered  several  partitionings  of  the  com¬ 
putational  domain,  leading  to  different  subdomain 
configurations  Nx  y-  Ny,  where  Nx  and  Ny  denote 
the  number  of  subdomains  in  respectively  x-  and 
2/— direction.  In  all  cases,  the  interior  ‘grid  points’  are 
equally  distributed  among  the  processors,  but  some 
load  imbalance  occurs,  due  to  the  unequal  distribu¬ 
tion  of  the  boundary  points. 

In  the  previous  section  we  have  indicated  that  the 
load  balance  factor  (7)  gives  a  good  prediction  of 
the  parallel  efficiency  and  the  parallel  speedup,  when 
the  communication  overhead  is  small  and  when  all 
grid  points  require  approximately  the  same  amount 
of  work.  Table  4  presents  these  predicted  efficiencies 
and  speedups,  and  also  shows  the  parallel  efficiencies 
and  speedups  that  are  obtained  when  the  linear  sys¬ 
tems  are  solved  with  a  Red-Black  Gauss-Seidel  relax¬ 
ation  scheme  on  an  Intel  iPSC/2  parallel  computer. 
(Similar  performances  will  be  obtained  on  other  par¬ 
allel  computers  with  a  similar  machine  characteristic 

^comm  locale') 

The  results  differ  from  the  predicted  values  for  two 
reasons  :  (1)  the  actual  load  imbalance  is  smaller 
than  predicted  because  the  boundary  points  (caus¬ 
ing  load  imbalance)  require  less  operations  than  the 
other  points;  (2)  the  parallel  overhead  is  higher 
due  to  the  communication  overhead.  Note  that  (1) 
and  (2)  have  opposite  effects  on  the  parallel  efficiency 
and  speedup.  Since  the  ratio  m^/my  =  98/34  is 
approximately  equal  to  3,  subdomain  configurations 
with  NxjNy  ~  3  yield  nearly  square  subdomains. 


The  results  in  Table  4  clearly  show  that,  for  a  fixed 
number  of  subdomains,  the  load  balance  factor  and 
the  achieved  parallel  efficiency  is  maximal  for  nearly 
square  subdomains. 

4.2.2  Total  efficiency  and  speedup 

However,  to  measure  the  actual  performance  of  a 
parallel  solver,  one  should  rather  consider  the  to¬ 
tal  speedup  —  see  Eq.  2  —  instead  of  the  parallel 
speedup,  by  taking  into  account  the  numerical  qual¬ 
ity  of  the  parallel  algorithms.  We  have  therefore  com¬ 
pared  the  convergence  properties  of  Red-Black  and 
lexicographic  Gauss-Seidel  relaxation  schemes.  For 
the  lexicographic  Gauss-Seidel  scheme,  two  sweep 
directions  were  used  alternatingly.  For  this  test 
problem,  the  number  of  relaxation  steps  required  to 
achieve  convergence  for  lexicographic  and  Red-Black 
Gauss-Seidel  were  492  and  1090  respectively.  Thus 
the  (sequential)  execution  time  with  the  Red-Black 
Gauss-Seidel  scheme  is  more  than  2  times  higher  than 
with  lexicographic  Gauss-Seidel.  As  a  result,  the  to¬ 
tal  efficiency  (taking  into  account  the  numerical  effi¬ 
ciency)  of  the  parallel  Red-Black  relaxation  scheme 
is  less  than  50  %,  even  when  the  parallel  efficiency  is 
nearly  100  %  ! 

An  alternative  is  to  use  a  multi-block  approach:  each 
subdomain  is  treated  independently  (i.e.  in  parallel) 
and  in  each  subdomain  a  lexicographic  Gauss-Seidel 
relaxation  is  performed.  Information  on  subdomain 
boundaries  is  exchanged  after  each  complete  relax¬ 
ation  step  (i.e.  after  an  upward  and  a  downward 
sweep  through  the  cells). 

Because  the  blocks  (subdomains)  themselves  are 
treated  in  a  Jacobi  fashion,  we  expect  convergence 
degradation  when  the  number  of  blocks  grows.  This 
is  indeed  the  case,  as  reported  in  Table  5.  The  re¬ 
quired  number  of  relaxation  steps  depends  on  (a)  the 
number  of  subdomains,  i.e.  the  number  of  proces¬ 
sors  and  (b)  the  aspect  ratio  of  the  subdomains.  Of¬ 
ten,  the  configuration  with  nearly  square  subdomains 
yields  the  fastest  convergence. 

The  total  speedup  of  this  multi-block  solver  can  now 
be  defined  as  the  ratio  of  the  execution  times  on  P 
processors  and  on  one  processor  to  reach  the  pre¬ 
scribed  convergence  criterion.  The  achieved  total 
speedup  and  total  efficiency  for  some  subdomain  con¬ 
figurations  are  given  in  Table  6.  Clearly,  for  this 
test  problem  and  when  the  number  of  subdomains  is 
not  too  high,  the  ‘Block  Jacobi,  lexicographic  Gauss- 
Seidel’  scheme  is  to  be  preferred  above  the  (single 
block)  Red-Black  Gauss-Seidel  scheme,  because  the 
total  efficiency  of  the  latter  scheme  is  less  that  50  %. 

The  convergence  degradation  of  multi-block  implicit 
methods  due  to  the  increase  of  the  number  of  blocks 
is  problem  dependent,  but  in  most  cases  the  number 
of  iterations  increases  only  slightly.  In  [40]  a  study 
of  the  performance  degradation  for  several  implicit 
schemes  is  presented.  The  transonic  flow  computa¬ 
tion  over  the  NACA0012  airfoil  (see  section  4.1,  but 
with  0°  angle  of  attack)  has  been  used  as  a  testcase. 
In  all  implicit  solvers  considered  in  this  study,  the  Ja- 
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subdomain 
configuration 
iVx  X  Ny 

grid  points 

predicted 
parallel 
efficiency  (%) 

achieved 
parallel 
efficiency  (%) 

predicted 

parallel 

speedup 

achieved 

parallel 

speedup 

N 

J » average 

N 

*  max 

16  X  1 

208.25 

238 

87.5 

90.5 

14.0 

14.5 

8x2 

208.25 

221 

94.2 

92.0 

15.1 

14.7 

4x4 

208.25 

225 

92.6 

89.1 

14.8 

14.3 

2x8 

208.25 

245 

85.0 

83.2 

13.6 

13.3 

1  X  16 

208.25 

294 

70.8 

73.4 

11.3 

11.7 

8  X  1 

416.5 

442 

94.2 

95.7 

7.54 

7.66 

4x2 

416.5 

425 

98.0 

97.4 

7.84 

7.79 

2x4 

416.5 

441 

94.4 

93.4 

7.55 

7.47 

1x8 

416.5 

490 

85.0 

86.4 

6.80 

6.92 

4x1 

833 

850 

98.0 

98.9 

3.92 

3.96 

2x2 

833 

833 

100 

99.3 

4.00 

3.97 

1  X  4 

833 

882 

94.4 

94.7 

3.78 

3.79 

2x1 

1666 

1666 

100 

99.6 

2.00 

1.99 

1x2 

1666 

1666 

100 

98.9 

2.00 

1.98 

1  X  1 

3332 

3332 

100 

100 

1.00 

1.00 

Table  4:  Effect  of  the  partitioning  strategy  on  the  load  balance  (=  pre¬ 
dicted  parallel  efficiency)  and  the  achieved  parallel  efficiency 
of  an  implicit  Euler  solver. 


subdomain 

configuration 

X  Ny 

Block  Jacobi 

lexicographic  Gauss-Seidel 

Block  Jacobi 
line  Gauss-Seidel 

16  X  1 

623 

349 

8x2 

570 

302 

4x4 

574 

326 

2x8 

627 

394 

1  X  16 

921 

620 

8x1 

527 

265 

4x2 

503 

278 

2x4 

546 

312 

1x8 

646 

386 

4x1 

489 

234 

2x2 

467 

262 

1x4 

540 

304 

2  X  1 

438 

234 

1  X  2 

457 

250 

1  X  1 

430 

230 

Table  5:  Implicit  Euler  solver  based  on  a  multi-block  approach:  number 
of  iterations  as  function  of  the  subdomain  configuration. 


subdomain 

number 

total 

total 

configuration 

of  steps 

speedup 

efficiency  (%) 

8x2 

570 

11.3 

70.4 

4x4 

574 

10.9 

67.9 

4x2 

503 

6.66 

83.2 

4  X  1 

489 

3.48 

86.9 

2x2 

467 

3.65 

91.2 

2  X  1 

438 

1.96 

98.0 

1  X  1 

430 

1.00 

100 

Table  6:  Total  speedup  and  efficiency  of  the  multi-block  implicit  Euler 
Solver  (Block  Jacobi,  lexicographic  Gauss- Seidel). 
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cobian  matrices  of  the  residual  are  evaluated  with  a 
first  order  upwind  discretisation  (Van  Leer,  Approx¬ 
imate  Steger- Warming  or  Yoon-Jameson),  while  the 
residual  driven  to  zero  is  either  first  order  or  second 
order.  Both  line  Gauss-Seidel  and  ADI  methods  are 
used  to  solve  the  linearised  system  in  each  time  step. 
The  second  order  residuals  are  based  on  MUSCL  ex¬ 
trapolation  (third  order  upwind  biased)  and  a  gener¬ 
alised  minmod  limiter  with  a  compression  factor  of 
2. 

The  following  implicit  methods  have  been  investi¬ 
gated,  where  the  first  item  refers  to  the  implicit  solver 
and  the  second  item  to  the  residual  driven  to  zero: 

•  Van  Leer/Van  Leer  -  Line  Gauss-Seidel 
(VL/VL-LGS) 

•  Van  Leer/Van  Leer  -  ADI  (VL/VL-ADI) 

•  Approximate  Steger- Warming/Roe  -  Line 
Gauss-Seidel  (ASW/R-LGS) 

•  Approximate  Steger- Warming/Roe  -  ADI 

(ASW/R-ADI) 

•  Yoon- Jameson/Roe  -  LU-SSOR  (YJ/R-LU- 
SSOR) 

For  the  line  Gauss-Seidel  scheme,  four  different  sweep 
directions  are  possible,  namely  in  the  positive  and 
negative  i-  and  j-directions.  These  sweep  patterns 
are  indicated  in  Figure  10. 


Fig.  10:  Sweep  patterns  for  LGS 

Experiments  indicate  that,  for  the  single  block  case, 
fewer  iterations  are  needed  when  sweeping  in  the 
j-direction  (‘j-sweeps’)  than  when  sweeping  in  the  i- 
direction  (‘i-sweeps’).  This  is  to  be  expected,  since 
within  a  j-sweep,  240  cells  along  the  i-direction  are 
taken  implicitly,  while  within  an  i-sweep  only  19  cells 
along  the  j-direction  are  taken  implicitly.  The  higher 
implicitness  of  the  solver  for  j-sweeps  leads  to  faster 
convergence.  Sweeping  in  the  positive  or  negative 
direction  has  only  a  slight  influence  on  the  number 
of  iterations. 

Assume  now  that  a  multi-block  approach  is  used, 
with  up  to  16  blocks  obtained  by  a  ID  partitioning 
of  the  grid,  orthogonal  to  the  airfoil  (as  in  Fig.  11a). 

The  performance  degradation  of  the  multi-block 
implementation  of  the  schemes  presented  above  is 


shown  in  Table  7.  As  an  initial  guess,  a  first  order 
solution  computed  with  the  same  explicit  operator  as 
the  one  used  in  the  second  order  computation,  was 
employed.  The  convergence  criterion  was  a  reduction 
of  the  residual  by  a  factor  of  10“*.  The  first  order  and 
the  second  order  calculations  have  been  done  with 
respectively  CFL  =  30  and  CFL  =  4.  For  the  Line 
Gauss-Seidel  schemes,  j-sweeps  were  used.  The  com¬ 
bination  ASW/R-LGS  did  not  converge  for  CFL  = 
4  in  the  single  block  case  (a  decrease  of  the  CFL- 
number  was  necessary  for  convergence).  The  LU- 
SSOR  scheme  needs  more  iterations  than  the  other 
schemes;  but  one  LU-SSOR  iteration  is  considerably 
cheaper  than  an  LGS  or  ADI  iteration. 

The  results  in  Table  7  show  that  no  severe  degrada¬ 
tion  in  performance  occurs  for  any  of  the  schemes 
tested,  with  up  to  16  blocks.  Note  that  for  the 
Line  Gauss-Seidel  schemes,  a  stronger  degradation 
is  to  be  expected  when  j-sweeps  are  used  —  as  in 
the  tests  reported  here  —  than  when  i-sweeps  are 
used.  Indeed,  because  of  the  ID  partitioning  orthog¬ 
onal  to  the  airfoil,  the  block  boundaries  are  along 
the  j-direction.  When  j-sweeps  are  used,  the  ‘implic¬ 
itness’  in  the  i-direction  is  cut  by  the  block  bound¬ 
aries.  If  i-sweeps  would  have  been  used,  the  implicit¬ 
ness  (in  the  j-direction)  would  not  have  been  affected 
by  the  partitioning.  Thus  for  a  very  large  number 
of  blocks,  i-sweeps  will  be  more  efficient,  since  line 
Gauss-Seidel  with  j-sweeps  degenerates  to  a  point 
Jacobi  scheme,  while  with  i-sweeps  the  scheme  de¬ 
generates  to  a  line  Jacobi  scheme,  which  is  still  a 
powerful  scheme.  However  for  a  moderate  number 
of  blocks,  j-sweeps  are  more  efficient,  since  only  384 
iterations  are  needed  when  j-sweeps  (in  the  positive  j- 
direction)  are  used,  compared  to  561  iterations  when 
-H i-sweeps  are  used. 

Convergence  degradation  can  also  be  observed  when 
a  preconditioned  Krylov  subspace  iteration  is  used 
as  linear  system  solver  :  often  an  efficient  precondi¬ 
tioner  (e.g.  ILU)  is  replaced  in  the  parallel  code  by  a 
less  effective  preconditioner  (e.g.  diagonal  precondi¬ 
tioner),  that  can  be  parallelised  more  easily.  Also  in 
this  case,  the  pure  parallel  efficiency  and  speedup  are 
not  the  appropriate  measures  for  the  performance, 
and  the  different  convergence  properties  of  the  se¬ 
quential  and  parallel  solver  must  be  taken  into  ac¬ 
count.  Also  in  this  case,  a  multi-block  approach  can 
be  useful  [41]. 

4.3  Load  balancing  of  block  struc¬ 
tured  CFD  codes 

We  now  describe  some  results  on  the  ‘mapping’  of 
block  structured  grids  on  distributed  memory  sys¬ 
tems  and  the  obtained  parallel  performance.  We  re¬ 
fer  to  [37]  for  more  information. 

Starting  from  the  grid  for  the  NACA0012  testcase 
with  16  blocks,  each  having  the  same  number  of  cells, 
we  have  created  block  structured  grids  with  up  to  103 
blocks  via  grid  refinement.  Blocks  have  been  refined 
by  doubling  the  number  of  grid  lines  in  both  direc¬ 
tions,  using  refinement  criteria  based  on  the  stream- 
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scheme 

1  block  ]  2  blocks  ]  4  blocks  ]  8  blocks  ]  16  blocks 

first  order 

VL/VL-LGS 

384 

384 

399 

411 

433 

ASW/R-LGS 

382 

383 

442 

456 

481 

second  order 

VL/VL-LGS 

1561 

1580 

1567 

1577 

1588 

VL/VL-ADI 

1332 

1340 

1300 

1299 

1322 

ASW/R-LGS 

* 

1885 

1897 

1892 

1919 

ASW/R-ADI 

1786 

1782 

1788 

1788 

1797 

YJ/R-LU-SSOR 

3338 

3332 

3533 

3590 

3670 

Table  7:  Number  of  iterations  required  to  achieve  convergence:  influ¬ 
ence  of  the  number  of  blocks  (LGS-schemes:  sweeps  in  the 
j-direction). 


wise  entropy  gradient.  Refined  blocks  are  split  into 
four  blocks.  Thus  all  blocks  contain  approximately 
the  same  number  of  cells.  The  resulting  block  struc¬ 
tures  are  shown  in  figure  11.  The  first  grid  counts  16 
blocks,  the  second  one  52  blocks  and  the  third  one 
103  blocks. 

A  similar  procedure  has  been  used  for  a  second 
testcase  :  the  computation  of  the  supersonic  flow 
in  a  scramjet  geometry.  The  inlet  conditions  are 
M  =  3.6,  angle  of  attack  of  0°,  To  =  300A’  and 
Pq  =  lOOOOOPa.  The  first  grid  contains  8160  cells, 
partitioned  into  24  blocks,  with  sizes  varying  from 
10  X  15  to  34  X  15.  The  block  structure  of  the  refined 
grids  with  24,  66,  132  and  161  blocks  (corresponding 
to  82152  cells)  is  shown  in  figure  12. 

Since  the  grid  is  already  partitioned  into  blocks,  load 
balance  and  communication  minimisation  must  be 
achieved  by  an  appropriate  mapping  of  the  blocks 
onto  the  processors.  Various  mapping  strategies  are 
incorporated  into  a  software  library,  that  we  have 
developed  to  hide  most  of  the  parallel  implementa¬ 
tion  details  from  the  application  programmer  [42]. 
The  software  library  is  especially  designed  to  sup¬ 
port  run-time  load  balancing  for  applications  that 
use  adaptively  refined  grids,  see  [37]  [43]  [38].  This 
software  library  has  been  used  for  the  parallelisation 
of  the  multiblock  code  used  for  the  tests  described  in 
this  section  and  in  section  3.1.  The  mapping  strat¬ 
egy  used  for  the  test  described  here  was  based  on 
a  recursive  bisection  technique  using  a  costfunction, 
that  takes  into  account  the  calculation  cost  for  each 
block,  the  communication  between  blocks  mapped 
onto  different  processors,  and  the  machine  architec¬ 
ture  (network  topology). 

Table  8  shows  the  effectivity  for  a  parallel  forward 
Euler  timestep  of  the  multi-block  code  on  an  Intel 
iPSC/860.  The  loss  of  effectivity  is  due  to  load  im¬ 
balance  and  communication  overhead.  The  achieved 
load  balance  is  reported  in  table  9.  The  correlation 
with  the  effectivity  in  table  8  reveals  that  load  im¬ 
balance  is  the  dominant  loss  factor. 

For  the  NACA0012  testcase,  all  16  initial  blocks  have 
the  same  size,  which  allows  a  perfect  load  balance  on 
up  to  16  processors.  Refinement  of  this  grid  leads 
to  52  and  103  blocks  of  almost  equal  size.  As  they 
cannot  be  equally  distributed  among  the  processors, 
some  imbalance  remains.  For  the  scramjet  testcase. 


we  start  with  24  blocks  of  varying  size.  Table  9  shows 
that  load  balancing  works  very  well  if  the  number  of 
blocks  is  much  larger  than  the  number  of  processors 
(or  if  the  block  sizes  are  well-chosen).  A  certain  vari¬ 
ation  in  block  sizes  is  beneficial  for  load  balancing. 
It  is  easily  verified  that  the  best  possible  load  bal¬ 
ance  that  can  be  obtained  with  blocks  of  equal  size 
is  worse  than  the  values  reported  in  Table  9. 

Table  10  shows  the  communication  cost,  including 
the  overhead  of  the  message  preparation  (‘packing’ 
and  ‘unpacking’  of  the  information  in  buffers).  The 
communication  cost  does  not  grow  fast  with  the  num¬ 
ber  of  processors. 

Table  11  lists  the  estimated  parallel  efficiency.  It  was 
impossible  to  determine  the  true  parallel  efficiency 
and  speedup,  as  the  refined  grid  did  not  fit  in  a  single 
node’s  memory.  Therefore,  the  single  processor  time 
was  estimated  as  the  total  calculation  time  plus  the 
time  for  copying  the  data  to  or  from  the  communica¬ 
tion  buffer,  using  the  same  block  structure.  The  esti¬ 
mated  speedup  is  reported  in  table  12.  The  speedups 
obtained  are  high,  due  to  the  good  load  balance  and 
the  fact  that  the  flow  solver  is  so  compute-intensive. 

These  results  indicate  that  the  mapping  strategy 
computes  a  good  mapping  of  blocks  onto  processors. 
Note  that  even  for  block  structured  grids  of  moderate 
complexity,  it  is  very  difficult  or  even  impossible  to 
find  a  good  block  distribution  by  hand  and  an  auto¬ 
matic  procedure  is  needed.  Mapping  techniques  are 
closely  related  to  grid  partitioning  techniques  used  to 
partition  unstructured  grids  for  parallel  processing. 
A  tutorial  on  grid  partitioning  techniques  is  given  in 
[44]. 
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number  of  processors 

1 

2 

4 

8 

16 

32 

64 

NACA  16  blocks 

99.5 

98.8 

97.8 

96.0 

94.4 

47.1 

NACA  52  blocks 

/ 

98.5 

97.0 

90.7 

81.1 

78.2 

NACA  103  blocks 

/ 

/ 

95.9 

94.6 

86.9 

78.9 

Scramjet  24  blocks 

/ 

98.7 

97.6 

97.3 

94.3 

50.1 

24.8 

Scramjet  66  blocks 

/ 

/ 

95.9 

92.9 

87.5 

83.4 

58.1 

Scramjet  132  blocks 

/ 

/ 

/ 

94.5 

93.2 

88.3 

82.0 

Scramjet  261  blocks 

/ 

/ 

/ 

/ 

91.2 

90.2 

80.6 

Table  8:  Effectivity  a  (%)  on  the  iPSC/860 


number  of  processors 

1 

2 

4 

8 

16 

32 

64 

NACA  16  blocks 

100.0 

100.0 

98.8 

97.7 

95.9 

48.0 

NACA  52  blocks 

/ 

99.7 

98.4 

92.7 

82.8 

80.4 

NACA  103  blocks 

/ 

/ 

98.3 

97.7 

89.6 

81.4 

Scramjet  24  blocks 

/ 

100.0 

99.8 

99.7 

94.3 

51.0 

25.5 

Scramjet  66  blocks 

/ 

/ 

98.6 

96.2 

90.8 

87.0 

60.6 

Scramjet  132  blocks 

/ 

/ 

/ 

98.3 

97.2 

92.5 

87.8 

Scramjet  261  blocks 

/ 

/ 

/ 

/ 

95.9 

94.8 

87.8 

Table  9:  Calculation  load  balance  A  (%)  on  the  iPSC/860 


number  of  processors 

1 

2 

4 

8 

16 

32 

64 

NACA  16  blocks 

0.5 

1.1 

0.8 

1.7 

1.3 

0.6 

NACA  52  blocks 

/ 

1.2 

1.2 

2.0 

1.9 

2.0 

NACA  103  blocks 

/ 

/ 

2.4 

3.0 

2.9 

2.6 

Scramjet  24  blocks 

/ 

1.2 

1.8 

2.1 

1.7 

1.0 

0.7 

Scramjet  66  blocks 

/ 

/ 

2.5 

3.1 

3.2 

3.0 

2.6 

Scramjet  132  blocks 

/ 

/ 

/ 

3.6 

4.1 

4.1 

5.2 

Scramjet  261  blocks 

/ 

/ 

/ 

/ 

4.2 

4.3 

5.9 

Table  10:  Communication  cost  (%)  on  the  iPSC/860 


number  of  processors 

1 

2 

4 

8 

16 

32 

64 

NACA  16  blocks 

100.0 

99.5 

98.4 

96.4 

95.4 

47.4 

NACA  52  blocks 

/ 

99.3 

97.8 

91.4 

81.7 

79.0 

NACA  103  blocks 

/ 

~T~ 

97.0 

95.5 

87.8 

79.6 

Scramjet  24  blocks 

/ 

99.5 

98.8 

98.1 

95.3 

50.3 

25.1 

Scramjet  66  blocks 

/ 

/ 

97.0 

94.0 

88.4 

84.4 

59.0 

Scramjet  132  blocks 

/ 

/ 

/ 

95.7 

94.4 

89.3 

83.8 

Scramjet  261  blocks 

/ 

/ 

/ 

/ 

92.4 

91.3 

82.2 

Table  11:  Estimated  efficiency  E  (%)  on  the  iPSC/860 


number  of  processors 

1 

2 

4 

8 

16 

32 

64 

NACA  16  blocks 

1 

1.99 

3.94 

7.71 

15.26 

15.17 

NACA  52  blocks 

/ 

1.99 

3.91 

7.31 

13.07 

25.28 

NACA  103  blocks 

/ 

~T~ 

3.88 

7.64 

14.05 

25.47 

Scramjet  24  blocks 

/ 

1.99 

3.95 

7.85 

15.25 

16.10 

16.06 

Scramjet  66  blocks 

/ 

/ 

3.88 

7.52 

14.14 

27.01 

37.76 

Scramjet  132  blocks 

/ 

/ 

/ 

7.66 

15.10 

28.58 

53.63 

Scramjet  261  blocks 

/ 

/ 

/ 

/ 

14.78 

29.22 

52.61 

Table  12:  Estimated  speedup  S  on  the  iPSC/860 
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1  SUMMARY 

Efficient  use  of  a  parallel  computer  requires  the 
data  and  the  operations  that  must  be  performed 
on  them  to  be  distributed  over  the  processors  in 
such  a  way  that  the  work  load  is  balanced  and 
the  communication  cost  minimised.  This  distri¬ 
bution  problem  is  called  the  load  balancing  prob¬ 
lem.  For  CFD  applications,  the  load  balancing 
problem  amounts  to  finding  a  partition  of  the 
grid  and  subsequently  a  mapping  of  the  subgrids 
to  the  processors,  that  balance  the  work  load 
and  minimise  the  communication  costs.  This  tu¬ 
torial  contains  a  description  of  well-established 
methods  for  partitioning  and  mapping  unstruc¬ 
tured  grids.  They  range  from  simple  heurist¬ 
ics,  over  global  optimisation  methods  to  very 
powerful  and  cost-effective  algorithms  that  com¬ 
bine  the  strengths  of  simpler  heuristics.  Most 
of  the  methods  that  will  be  discussed,  have 
been  implemented  in  some  well-documented  and 
-supported  partitioning  tools.  The  tutorial  dis¬ 
cusses  two  of  the  most  important  ones:  Chaco 
and  TOP/DOMDEC 

2  INTRODUCTION 

Today’s  parallel  computers  potentially  allow  very 
high  performances.  Obtaining  these  in  reality 
however,  requires  a  careful  analysis  of  the  prob¬ 
lem  and  of  the  solution  methods,  and  often  re¬ 
quires  that  the  characteristics  of  the  parallel  com¬ 
puter  are  taken  into  account  during  the  develop¬ 
ment  of  the  parallel  code. 

More  specifically,  to  obtain  high  performance  on 
a  parallel  computer,  it  is  of  paramount  import¬ 
ance  to  distribute  the  data  and  the  operations 
that  have  to  be  performed  on  them  in  such  a  way 
that  the  work  load  is  balanced  over  the  processors 
in  the  parallel  computer,  while  at  the  same  time 


the  communication  cost  is  kept  as  small  as  pos¬ 
sible.  We  call  this  distribution  problem  the  load 
balancing  problem. 

In  this  tutorial,  we  will  discuss  the  load  balan¬ 
cing  problem  for  Computational  Fluid  Dynamics 
(CFD)  applications.  Most  CFD  calculations  are 
grid-oriented,  i.e.  the  data  are  defined  on  a  dis¬ 
crete  grid  of  points,  finite  volumes  or  finite  ele¬ 
ments,  and  the  calculations  consist  of  applying 
certain  operations  on  (the  data  associated  with) 
all  the  points,  volumes  or  elements  of  the  grid. 

Grid-oriented  applications  are  usually  parallel¬ 
ised  by  partitioning  the  grid  and  by  distributing 
the  subgrids  among  the  processors  of  the  paral¬ 
lel  computer.  Each  processor  then  performs  the 
calculations  on  its  own  grid  points,  volumes  or 
elements.  For  grid-oriented  problems,  the  load 
balancing  problem  amounts  to  finding  a  partition 
of  the  grid  and  subsequently  a  mapping  of  the 
subgrids  to  the  processors,  that  balance  the  work 
load  and  minimise  the  communication  costs. 


2.1  Static  and  dynamic  load  balancing 

The  grids  that  are  used  in  Computational  Fluid 
Dynamics  can  either  be  structured  or  unstruc¬ 
tured,  static  or  adaptive  and  single  level  or  multi¬ 
level  (the  latter  in  case  multigrid  is  used).  If  both 
the  grid  and  the  amount  of  work  that  is  involved 
with  each  grid  point  do  not  change  during  the 
calculations,  the  distribution  of  a  grid-oriented 
application  can  be  done  statically,  usually  as  a 
pre-processing  step  on  a  sequential  computer.  If 
the  grid  or  the  calculations  do  change  however, 
the  grid  points  must  be  redistributed  dynamic¬ 
ally  over  the  processors  of  the  parallel  machine 
to  maintain  load  balance.  This  problem  is  much 
more  difficult  than  the  static  one  for  several  reas¬ 
ons: 
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1.  the  performance  of  the  parallel  computer 
must  be  monitored  to  detect  the  load  im¬ 
balance, 

2.  a  decision  must  be  made  as  to  whether  the 
gain  of  the  redistribution  will  outweigh  the 
cost  of  calculating  the  new  distribution  and 
transferring  the  grid  points;  if  this  cost  is 
very  high,  it  can  in  fact  be  advantageous  to 
proceed  with  an  unbalanced  distribution, 

3.  the  new  distribution  must  be  calculated  on 
the  parallel  computer,  which  requires  that 
the  distribution  algorithm  is  parallelised, 

4.  the  execution  time  of  the  balancer  is  much 
more  critical  than  in  the  static  case,  because 
the  new  distribution  is  used  for  a  shorter 
time  period, 

5.  the  rebalancing  algorithm  must  preferably 
find  a  distribution  that  is  similar  to  the  cur¬ 
rent  distribution,  so  that  only  a  mimimal 
number  of  grid  points  must  be  transferred. 

When  adaptive  refinement  is  used  in  a  CFD- 
code,  the  grid  remains  fixed  during  rather  long 
periods.  In  this  case  one  can  invoke  a  load  bal¬ 
ancer  after  each  grid  refinement.  This  type  of 
load  balancing  is  called  iterative  static  load  bal¬ 
ancing  [1,  2]  or  quasi-dynamic  load  balancing  [3]. 
The  techniques  that  will  be  discussed  in  this  tu¬ 
torial  are  meant  to  be  used  for  static  load  balan¬ 
cing.  Nevertheless,  many  of  them  are  also  useful 
for  quasi-dynamic  load  balancing.  More  specific 
algorithms,  that  explicitly  try  to  take  the  cost 
of  transferring  grid  points  into  account,  can  be 
found  in  [4,  5,  6]. 

2.2  Partitioning  and  mapping 

While  distributing  the  grid  points  of  a  structured 
grid  among  the  processors  of  a  parallel  computer 
is  a  straightforward  task,  doing  the  same  for  an 
unstructured  grid  is  very  complex.  The  problem 
can  be  alleviated  by  performing  the  distribution 
of  the  grid  points  among  the  processors  in  two 
steps.  First,  the  grid  is  partitioned  in  a  num¬ 
ber  of  subgrids  and  subsequently  these  subgrids 
are  mapped  onto  the  processors.  Typically,  the 
number  of  subgrids  is  chosen  equal  to  the  num¬ 
ber  of  processors.  In  principle,  the  partitioning 
only  depends  on  the  characteristics  of  the  prob¬ 
lem  while  the  mapping  takes  the  characteristics 
of  the  machine  into  account.  Therefore,  these  two 
separate  problems  are  easier  to  solve  than  the  ori¬ 
ginal  distribution  problem.  On  the  other  hand, 
solving  the  partitioning  problem  separately  from 
the  mapping  problem  usually  restricts  the  quality 


of  the  distribution  that  can  be  obtained  because 
decisions  made  during  the  partitioning  step  may 
inhibit  finding  a  good  mapping  afterwards. 

2.3  Requirements  for  partitioning 

In  general,  an  algorithm  based  on  grid  parti¬ 
tioning  or  domain  decomposition  involves  inter¬ 
face  operations  and  local  computations.  The 
interface  operations  consist  of  communication 
between  subdomains  and,  in  some  cases,  the  solu¬ 
tion  of  a  true  interface  problem  (i.e.  a  Schur- 
complement  operator)  or  the  assembly  of  subgrid 
quantities  at  their  common  interfaces.  The  local 
computations  correspond  either  to  the  solution  of 
a  local  subproblem  or  simply  to  the  explicit  eval¬ 
uation  of  a  subgrid  quantity. 

It  is  clear  that  in  order  to  keep  the  global  cal¬ 
culation  time  as  small  as  possible,  the  interface 
operations  should  take  as  little  time  as  possible. 
The  local  computations  should  also  take  as  little 
time  as  possible  and  should  be  balanced  evenly 
among  the  processors.  If  this  is  not  the  case,  the 
processors  will  have  to  wait  for  the  overloaded 
processor(s)  to  catch  up  before  they  can  start 
with  the  interface  calculations. 

From  these  general  requirements,  we  can  deduce 
the  requirements  for  a  mesh  partitioner. 

1.  The  time  taken  by  the  interface  operations 
is  a  function  of  the  number  of  points  on  the 
boundary  of  the  subgrid.  Very  often  it  is 
also  a  function  of  the  number  of  adjacent 
subgrids.  Therefore  the  length  of  the  bound¬ 
ary  and  the  number  of  adjacent  subgrids 
should  be  minimised  for  each  subgrid.  The 
latter  requirements  are  very  often  conflict¬ 
ing,  and  their  relative  importance  depends 
on  the  problem  and  on  the  characteristics  of 
the  parallel  computer  (especially  the  start¬ 
up  to  transfer  time  ratio). 

2.  The  time  for  the  local  calculations  is  a  func¬ 
tion  of  the  number  of  points  in  the  subgrid. 
If  the  amount  of  work  is  the  same  for  all 
the  grid  points,  each  subgrid  should  have  the 
same  number  of  points  to  balance  the  work 
load. 

If  the  local  and/or  the  interface  calculations  are 
implicit,  i.e.  involve  the  solution  of  a  system  of 
equations,  a  number  of  additional  considerations 
come  into  play: 

3.  If  the  local  calculations  are  implicit,  the  con¬ 
dition  of  this  problem  is  (strongly)  influ¬ 
enced  by  the  aspect  ratio  of  the  subgrid. 
It  can  be  shown  that  subgrids  with  bad  as¬ 
pect  ratios  (i.e.  subgrids  that  are  very  elong¬ 
ated)  generate  local  problems  that  are  poorly 
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conditioned  and  are  difficult  to  solve  iterat¬ 
ively  [7,  8].  Moreover,  ill-conditioned  local 
problems  have  a  negative  impact  on  the  iter¬ 
ative  solution  of  the  interface  problem  [9]  as 
well.  Elongated  subgrids  tend  to  have  long 
perimeters,  therefore  trying  to  obtain  inter¬ 
faces  with  minimal  length  will  typically  yield 
grids  with  good  aspect  ratios. 


4.  If  the  local  calculations  are  implicit,  and  if 
a  direct  method  is  used  to  solve  the  local 
system,  the  calculation  time  for  each  sub¬ 
grid  is  influenced  by  the  bandwidth  of  the 
local  matrix.  This  bandwidth  depends  on 
the  shape  of  the  subgrid. 


5.  If  a  frontal  method  is  used  to  solve  the  lin¬ 
ear  systems  arising  from  a  finite  element 
approach,  the  frontwidth  associated  with 
each  subgrid  should  not  be  greater  than 
the  frontwidth  of  the  global  grid.  Ideally, 
the  partitioning  should  generate  subgrids  in 
which  the  number  of  unknowns  at  the  in¬ 
terface  of  any  subgrid  is  smaller  than  the 
frontwidth  zissociated  with  the  undecom¬ 
posed  grid  and  the  frontwidth  of  each  sub¬ 
grid  is  at  most  comparable  to  the  frontwidth 
of  the  global  grid  [8]. 


2.4  Requirements  for  mapping 

The  mapping  algorithm  must  assign  the  subgrids 
to  the  processors  of  a  parallel  machine.  Prefer¬ 
ably,  subgrids  that  are  mutually  dependent  are 
mapped  onto  processors  that  can  communicate 
rapidly  with  each  other.  For  a  fully  connected 
machine,  the  mapping  task  is  trivial:  any  map¬ 
ping  is  as  good  as  the  other.  For  the  existing 
machines  with  a  limited  communication  topology 
(hypercube,  2D-,  or  3D-mesh, . . .)  this  is  not  the 
case.  Although  communication  between  arbit¬ 
rary  processors  can  be  done  efficiently,  nearest- 
neighbour  communication  is  preferable  because 
it  decreases  the  risk  for  communication  link  con¬ 
tention. 

As  mentioned  earlier,  the  mapping  task  is  not 
completely  independent  from  the  partitioning 
task.  The  mapping  task  can  be  seriously  facil¬ 
itated  by  already  taking  the  machine  topology 
into  account  during  the  partitioning  step  to  en¬ 
sure  that  the  dependency  topology  of  the  subgrids 
matches  the  communication  topology  of  the  ma¬ 
chine. 


3  A  CLASSIFICATION  OF  PARTI¬ 
TIONING  ALGORITHMS  FOR  UN¬ 
STRUCTURED  GRIDS 

3.1  General  optimisation  techniques 
based  on  a  cost  function 

The  most  general  approach  to  finding  an  optimal 
distribution  of  the  grid  points  among  the  pro¬ 
cessors  of  a  parallel  machine  is  to  model  the  total 
calculation  time  as  a  function  of  the  mapping.  In 
this  way  one  obtains  a  function  that  associates 
a  cost  with  each  feasible  distribution.  Thus  it  is 
possible  to  solve  the  partitioning  and  the  mapping 
problem  together.  In  fact,  the  cost  function  can 
be  quite  sophisticated,  taking  into  account  hard¬ 
ware  characteristics  and  communication  topology 
of  the  parallel  computer,  contention  of  the  com¬ 
munication  links  etc.  Normally,  the  cost  func¬ 
tion  contains  a  term  that  takes  the  communica¬ 
tion  cost  into  account  and  another  that  is  related 
to  the  load  imbalance.  The  relative  importance  of 
those  two  terms  depends  on  the  characteristics  of 
the  problem  and  of  the  parallel  computer.  Indeed 
it  is  sometimes  beneficial  to  tolerate  a  (slight) 
load  imbalance  if  this  decreases  the  communica¬ 
tion. 

The  cost  function  can  be  minimised  by  a  gen¬ 
eral  optimisation  technique  that  is  appropriate 
for  global  combinatorial  optimisation.  For  a  grid 
with  N  points  that  must  be  mapped  onto  P  pro¬ 
cessors,  the  search  space  has  cardinality  .  Be¬ 
cause  the  search  space  grows  exponentially  in  the 
grid  size,  total  enumeration  is  infeasible  for  real¬ 
istic  problems,  even  when  one  takes  advantage 
of  possible  symmetry  properties  or  uses  branch- 
and-bound  techniques  to  exclude  whole  parts  of 
the  search  space. 

However,  some  techniques  that  yield  good  sub- 
optimal  solutions  for  combinatorial  optimisation 
problems  do  exist.  Two  of  them,  viz.  simulated 
annealing  and  genetic  algorithms  are  frequently 
used. 

Simulated  annealing.  Simulated  annealing 
[10,  11]  is  a  very  general  optimisation  method 
which  stochastically  simulates  the  slow  cooling 
of  a  physical  system.  A  parameter  T,  analog¬ 
ous  to  the  temperature,  is  slowly  lowered  in  the 
course  of  the  calculations.  For  each  temperat¬ 
ure  a  number  of  transitions  of  the  current  solu¬ 
tion  are  consecutively  proposed  and  either  ac¬ 
cepted  or  rejected  according  to  the  Metropolis 
criterion:  If  the  cost  function  decreases  (cost  in¬ 
crease  AC  <  0),  the  change  is  accepted  uncon¬ 
ditionally,  otherwise  it  is  accepted  with  probab¬ 
ility  exp(— AC/T).  It  can  be  proved  that  un¬ 
der  certain  conditions  the  probability  to  find  the 
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global  optimum  tends  to  1.  In  practice,  for  suf¬ 
ficiently  slow  cooling  rates  this  method  produces 
good  solutions,  but  then  the  method  is  very  ex¬ 
pensive.  Results  of  simulated  annealing  for  grid 
partitioning  can  be  found  in  [12,  3]. 

Genetic  algorithms.  Genetic  algorithms  [13, 
14]  resemble  simulated  annealing  in  that  they 
are  also  general  and  robust  optimisation  methods 
that  simulate  an  optimisation  process  found  in 
nature.  More  specifically,  genetic  algorithms  sim¬ 
ulate  the  processes  of  reproduction,  crossover  and 
selection  that  make  living  beings  optimally  adap¬ 
ted  to  their  environment.  Genetic  algorithms  are 
potentially  able  to  yield  optimal  or  near-optimal 
solutions  but  take  a  large  amount  of  time.  Res¬ 
ults  of  genetic  algorithms  for  partitioning  prob¬ 
lems  are  reported  in  [15,  16,  17]. 

Modelling  the  execution  time  as  a  function  of  the 
distribution  of  the  grid  points  has  the  advant¬ 
age  that  the  partitioning  and  mapping  problem 
can  be  solved  together,  and  that  sophisticated 
cost  models  can  be  used.  However,  stochastic  op¬ 
timisation  algorithms  are  extremely  slow,  can  be 
trapped  in  local  minima,  and  their  behaviour  de¬ 
pends  on  a  lot  of  parameters,  that  must  be  care¬ 
fully  tuned  to  optimise  performance. 

3.2  Specific  grid  partitioning  heuristics 

3.2.1  Introduction 

To  make  the  distribution  problem  more  tractable, 
one  normally  makes  the  following  simplifications: 

1.  The  partitioning  and  the  mapping  problem 
are  handled  separately.  Subsequently,  we 
will  restrict  ourselves  to  the  partitioning 
problem. 

2.  Rather  than  trying  to  minimise  both  compu¬ 
tational  workload  imbalance  and  communic¬ 
ation  simultaneously,  only  one  of  both  terms 
is  explicitly  modelled  while  the  other  is  used 
implicitly  in  guiding  the  search.  In  this 
way  the  search  space  can  be  substantially 
reduced.  Most  often,  one  explicitly  tries  to 
minimise  communication  while  the  heuristic 
implicitly  provides  equally-sized  subgrids. 

3.2.2  Clustering  techniques 

Some  authors  have  proposed  partitioning  and 
mapping  strategies  based  on  clustering  tech¬ 
niques.  In  these  approaches  clusters  of  grid 
points  are  formed  with  high  intra-cluster  depend¬ 
encies  and  low  inter-cluster  communication.  The 


clustering  is  based  on  a  sorting  of  the  grid  points 
and  subsequent  partitioning. 

Mapping  algorithm  of  Sadayappan.  Sa- 
dayappan  et  al.  [18,  19]  proposed  a  nearest- 
neighbour  mapping  algorithm,  that  proceeds  in 
two  steps: 

1.  An  initial  mapping  is  generated  by  grouping 
grid  points  in  clusters  and  assigning  clusters 
to  processors  so  that  the  nearest-neighbour 
property  is  satisfied,  i.e.  neighbouring  points 
are  assigned  either  to  the  same  processor  or 
to  neighbouring  processors. 

2.  The  initial  mapping  is  successively  modified 
using  a  boundary  refinement  procedure  in 
which  points  are  reassigned  among  the  pro¬ 
cessors  in  a  manner  that  improves  calcula¬ 
tion  load  balance  but  always  maintains  the 
nearest-neighbour  property. 

Thus  the  nearest-neighbour  mapping  scheme  ex¬ 
plicitly  attempts  to  minimise  calculation  load 
imbalance,  while  low  communication  costs  are 
achieved  implicitly  by  the  search  strategy. 

Bandwidth  reduction  algorithms.  Algo¬ 
rithms  that  reduce  the  bandwidth  and  the  profile 
of  a  (sparse)  matrix  by  re-ordering  the  equations 
and  the  unknowns  of  the  linear  system  can  also 
be  used  for  partitioning  meshes  [8,  20]. 

For  a  given  numbering  of  the  n  elements  of  a 
mesh,  we  can  associate  an  adjacency  matrix  A, 
which  is  a  symmetric  nxn  matrix  with  elements 
aij  that  are  equal  to  either  1  or  0  according  to 
whether  the  elements  i  and  j  are  or  are  not  ad¬ 
jacent  in  the  mesh.  Let  mi  [i  =  1, . . .,  n)  be  the 
smallest  number  for  which  aij  =  0  if  |  i— j  |  >  mi. 
The  bandwidth  of  A  is  then  defined  as  raaxj  mi, 
and  the  profile  as 

If  the  elements  of  the  mesh  have  been  numbered 
in  such  a  way  that  the  adjacency  matrix  has  a 
small  profile  and  bandwidth,  a  lexicographic  par¬ 
titioning  of  the  mesh  will  often  place  adjacent 
elements  in  the  same  subgrid,  and  each  subgrid 
will  only  have  a  limited  number  of  neighbouring 
subgrids.  Figure  1  illustrates  this.  Notice  that 
two  adjacent  elements  are  assigned  to  different 
subgrids  if  the  corresponding  element  in  the  ad¬ 
jacency  matrix  is  not  in  one  of  the  blocks  on  the 
main  block  diagonal  and  that  two  subgrids  are 
adjacent  if  the  corresponding  off-diagonal  block 
is  non-zero. 

The  Reverse  Cuthill-McKee  (RCM)  ordering 
scheme  [21]  is  one  of  the  most  popular  techniques 
for  reducing  the  bandwidth  and  the  profile  of 
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Fig.  1: 


Mesh  partitioned  into  4  submeshes 
and  the  corresponding  adjacency 
matrix 


Fig.  2: 


Mesh  in  Fig.  1,  numbered  with  the 
Cuthill-McKee  heuristic  and  the  cor¬ 
responding  adjacency  matrix 


sparse  matrices.  The  Cuthill-McKee  scheme,  ap¬ 
plied  to  the  adjacency  matrix  of  a  mesh,  essen¬ 
tially  clusters  the  elements  in  level  sets: 

choose  an  initial  element  ; 

Si  :=  {initial  element}  ; 

while  not  all  elements  have  been  ad¬ 
ded  to  a  level  set  do 

Si+i  :=  0  ; 

forall  elements  Ck  6  Si  in  the  order 
that  they  have  been  added  to  5,-  do 

add  to  5,-1- 1  the  adjacent  elements 
of  gfc  that  have  not  yet  been  ad¬ 
ded  to  a  level  set  ; 

endfor ; 
endwhile ; 

Next,  the  elements  are  numbered  in  the  order  that 
they  have  been  added  to  the  level  sets. 

For  the  mesh  in  Fig.  1,  the  Cuthill-McKee  al¬ 
gorithm,  initiated  with  the  upper  left  element  cre¬ 
ates  the  level  sets  Si  =  {!},  S2  =  (3),  S3  — 
(2, 5},  54  =  {4, 6, 7},  55  =  {8, 9},  56  =  {10, 11}, 
and  S7  —  {12}.  The  resulting  numbering,  par¬ 
titioning,  and  adjacency  matrix  are  shown  in 
Fig.  2. 

Usually,  the  order  obtained  with  the  Cuthill- 
McKee  algorithm  is  reversed.  This  does  not  af¬ 
fect  the  bandwidth  of  the  matrix  but  it  often  de¬ 
creases  its  profile. 

Bandwidth  minimiser  algorithms  have  the  ad¬ 
vantage  that  for  each  subdomain,  the  number  of 
adjacent  subdomains  is  small.  Therefore,  each 
processor  must  only  send  messages  to  a  small 


number  of  neighbours.  This  can  be  important  if 
the  start-up  cost  of  sending  a  message  is  high. 
However,  bandwidth  minimiser  algorithms  have 
a  tendency  to  generate  very  elongated  subgrids 
with  rather  large  interface  sizes.  Usually  these 
subdomains  enjoy  a  very  small  local  bandwidth, 
but  suflfer  from  a  very  bad  aspect  ratio.  These 
problems  are  alleviated  if  the  RCM  algorithm  is 
used  recursively  [8]. 


Greedy  heuristic  of  Farhat.  For  the  parti¬ 
tioning  of  finite  element  meshes,  Farhat  [22]  pro¬ 
posed  a  greedy  algorithm  that  uses  only  con¬ 
nectivity  information.  A  variation  of  the  al¬ 
gorithm  that  also  uses  geometrical  information 
was  proposed  by  Al-Nasra  and  Nguyen  [23].  We 
will  discuss  Farhat’s  heuristic  into  more  detail  in 
Section  4. 


3.2.3  Geometry-based  techniques 

In  the  geometry-based  techniques  the  partition¬ 
ing  is  based  on  geometrical  information  about 
the  grid  points  (i.e.  their  coordinates).  This 
is  sensible  because,  in  most  problems,  interde¬ 
pendent  grid  points  are  geometrically  adjacent. 
Geometry-based  techniques  are  typically  cheap 
methods  that  are  nevertheless  able  to  produce  ac¬ 
ceptable  partitionings.  They  are  dealt  with  in 
Section  5. 
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3.2.4  Heuristics  for  graph  partitioning 

The  graph  partitioning  problem  can  be  formu¬ 
lated  as  follows.  An  undirected  graph  G  =  (F,  E) 
with  vertex  set  V  and  edge  set  E  is  given.  Often, 
weights  We{eij)  are  attributed  to  the  edges  Cij  G 
E.  Also  given  are  P  positive  integers  ni, . .  .,np, 
satisfying  =  n  =  [Fj.  The  problem  is 

then  to  partition  the  vertex  set  F  into  P  disjoint 
subsets  Fi, . . . ,  Fp  of  sizes  ni, . .  .,np,  respect¬ 
ively,  in  such  a  way  that  the  sum  of  the  weights  of 
edges  connecting  different  subsets  is  minimal.  In 
most  cases  we  require  that  ni  =  n2  —  =  np. 

An  edge  which  connects  two  distinct  subsets  is 
said  to  be  cut  by  the  partition.  The  graph  par¬ 
titioning  problem  can  also  be  generalised  to  the 
case  that  the  vertices  u,-  too  have  a  weight  (u,) . 
In  this  case,  the  weights  of  the  subsets  are  im¬ 
posed  instead  of  their  sizes. 

Now,  for  grid-oriented  problems,  a  dependency 
graph  can  be  defined  as  follows: 


Fig.  3: 


Finite  element  mesh  and  correspond¬ 
ing  dual  graph 


1.  Each  grid  point  has  a  corresponding  vertex 
in  the  graph.  The  weight  of  the  vertex  is 
proportional  to  the  amount  of  computational 
work  that  is  involved  with  the  grid  point. 
Very  often  the  weights  of  all  vertices  are 
equal,  e.g.  when  iterative  solution  schemes 
are  used. 

2.  For  each  pair  of  mutually  dependent  grid 
points,  the  corresponding  vertices  are  con¬ 
nected  by  an  edge  in  the  graph.  The 
weight  of  the  edge  is  proportional  to  the 
strength  of  the  interdependency  (‘communic¬ 
ation  volume’). 

For  a  finite  element  mesh,  two  elements  are  usu¬ 
ally  dependent  on  each  other  if  they  share  an  edge 
in  two  dimensions  or  a  face  in  three  dimensions. 
Therefore,  the  interdependency  graph  is  simply 
the  dual  graph  of  the  mesh.  An  example  is  given 
in  Fig.  3. 

Obviously,  the  grid  partitioning  problem  is  equi¬ 
valent  to  the  graph  partitioning  problem  for  the 
dependency  graph.  The  graph  partitioning  prob¬ 
lem  is  an  NP-complete  problem  but  a  num¬ 
ber  of  specific  heuristics  that  yield  good  near- 
optimal  solutions  do  exist.  We  will  discuss  the 
Kernighan-Lin  heuristic,  the  Recursive  Graph 
Bisection  algorithm,  and  the  Recursive  Spectral 
Bisection  algorithm. 

Kernighan-Lin  heuristic.  Already  in  1970, 
Kernighan  and  Lin  introduced  a  heuristic  to  par¬ 
tition  a  graph  into  two  or  more  subgraphs  [24]. 
Their  heuristic  only  partitions  graphs  without 
vertex  weights,  but  generalisation  to  graphs  with 


unequal  vertex  weights  is  straightforward.  Fi- 
duccia  and  Mattheyses  [25]  demonstrated  that,  if 
the  vertex  weights  are  small  integer  numbers,  the 
algorithm  can  be  organised  in  such  a  way  that 
the  complexity  of  the  algorithm  is  only  linear  in 
the  number  of  edges. 

Recursive  Graph  Bisection  algorithm. 
The  Recursive  Graph  Bisection  algorithm  (RGB) 
recursively  determines  two  vertices  of  maximal 
or  near  maximal  distance  in  the  (sub) graph,  and 
subsequently  assigns  the  vertices  to  one  or  to 
the  other  subset,  according  to  whether  they  are 
closer  to  one  or  to  the  other  extremal  vertex. 
To  determine  the  distance  between  two  vertices, 
the  graph  distance  is  used,  i.e.  the  length  of  the 
shortest  path  between  the  vertices.  A  more  thor¬ 
ough  discussion  of  this  algorithm  and  a  compar¬ 
ison  with  other  partitioning  algorithms  can  be 
found  in  [26,  27]. 

Recursive  Spectral  Bisection  algorithm. 
The  Recursive  Spectral  Bisection  algorithm 
(RSB)  is  based  upon  results  from  spectral  graph 
theory,  in  which  eigenvectors  of  a  matrix  are  used 
to  bisect  a  graph.  This  algorithm  is  discussed  in 
detail  in  Section  6. 


4  THE  GREEDY  HEURISTIC  OF  FAR- 
HAT 

4.1  Description 

The  greedy  algorithm  of  Farhat  [22]  is  a  heuristic 
that  despite  its  simplicity  often  yields  subgrids 
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with  short  boundaries  and  good  aspect  ratios. 
The  algorithm  first  assigns  to  each  node  rii  of 
the  mesh  a  weight  Wi  that  is  equal  to  the  number 
of  elements  that  are  connected  to  it.  Let  T® 
and  C®  respectively  denote  the  body,  the  inter¬ 
face  boundary,  and  the  computational  cost  of  a 
subdomain  s  and  let  C  denote  the  computational 
cost  of  the  whole  domain.  The  algorithm  consec¬ 
utively  finds  the  domains  Once  the 

first  s  —  1  domains  Qi, . . . ,  have  been  found, 
it  constructs  the  next  domain  fi®  as  follows: 

locate  a  node  rii  E.  T®"^  that  has  a  min¬ 
imal  current  weight  Wi  ; 
initialise  Q®  with  all  un-masked  ele¬ 
ments  that  are  connected  to  node  n,’  ; 

for  each  element  E  f2®  do  recurs¬ 
ively 

mask  Ck  ; 

for  each  node  n,-  attached  to  do 
reduce  the  weight  Wi  by  one  ; 

endfor ; 

add  to  Q®  all  un-masked  elements 

that  are  adjacent  to  ; 

update  C®  ; 

break  when  C®  =  CfP. 
endfor ; 

Figure  4  illustrates  how  the  algorithm  expands  a 


Expansion  of  a  subdomain  using  the 
greedy  algorithm  (after  Farhat  [22]). 

subdomain  starting  from  the  lower  left  element. 
The  greedy  heuristic  of  Farhat  is  probably  the 
fastest  partitioning  algorithm.  Since  it  can  im¬ 
mediately  partition  a  grid  into  the  desired  num¬ 
ber  of  subgrids,  it  is  not  necessary  to  use  it  re¬ 
cursively.  This  has  the  advantage  that  the  cal¬ 
culation  time  is  essentially  independent  of  the 
desired  number  of  subgrids.  In  general  this  al¬ 
gorithm  generates  subgrids  with  good  aspect  ra¬ 
tios,  but  it  often  yields  disconnected  subgrids. 


4.2  Some  examples 

Figure  5  shows  a  two-dimensional  finite  element 
mesh  with  9000  triangular  elements  and  with 
13278  internal  boundary  edges  round  an  airfoil. 
Only  the  part  of  the  mesh  that  lies  in  the  vicin¬ 
ity  of  the  airfoil,  and  which  is  strongly  refined, 
is  shown.  Partitioning  this  mesh  into  eight  sub¬ 
grids  with  the  greedy  algorithm  of  Farhat  yields 
the  partitioning  in  Fig.  6.  This  partition  cuts  355 
edges.  Notice  that  the  subgrids  have  a  good  as¬ 
pect  ratio.  However,  the  subgrid  that  was  created 
last  (darkly  shaded  in  Fig.  6)  is  disconnected  into 
three  parts. 

Figure  7  shows  the  partition  into  eight  subgrids  of 
the  two-dimensional  RYMAMO  model  [28].  This 
model  is  a  finite  difference  grid  with  18675  points 
that  covers  the  mouth  of  the  Rhine  and  the  Meuse 
and  the  coastal  zone  near  the  harbour  of  Rotter¬ 
dam.  The  greedy  algorithm  was  actually  applied 
on  a  finite  element  mesh  with  quadrilateral  ele¬ 
ments,  so  that  each  element  of  this  mesh  corres¬ 
ponds  to  a  point  in  the  finite  difference  grid  and 
in  such  a  way  that  the  connectivities  were  pre¬ 
served.  It  is  very  difficult  to  partition  this  mesh 
into  connected  subgrids  and  in  the  partition  that 
one  obtains  with  the  greedy  algorithm,  effectively 
four  subgrids  out  of  eight  are  disconnected.  The 
partition  yields  22  connected  parts  and  cuts  365 
edges. 


5  GEOMETRY  BASED  BISECTION 
ALGORITHMS 

5.1  Introduction 

In  the  geometry-based  bisection  algorithms,  one 
tries  to  exploit  the  geometric  properties  of  the 
mesh,  since  data  dependent  grid  points  are  geo¬ 
metrically  adjacent.  Clearly,  this  limits  the  ap¬ 
plicability  of  this  type  of  methods  to  problems 
where  such  geometric  information  is  both  mean¬ 
ingful  and  available. 

Based  on  the  geometrical  information,  a  scalar 
quantity  cr,-  is  associated  with  each  grid  point. 
Following  Williams  [3],  we  call  cr,-  a  separator 
field.  By  evaluating  the  median  S  of  the  set  {cr,-}, 
we  can  bisect  the  grid,  according  to  whether  <7,- 
is  greater  or  less  than  S.  In  this  way  two  sub¬ 
grids  with  an  equal  number  of  grid  points  are 
created.  By  recursively  applying  this  strategy 
to  the  subgrids,  the  grid  can  be  partitioned  into 
2‘^,d=  1,2,...  subgxids. 

Notice  that,  based  on  the  ordering  of  the  separ¬ 
ator  field,  a  grid  could  easily  be  partitioned  into 
more  than  two  parts  at  once.  However,  the  as¬ 
pect  ratio  of  the  subgrids  is  usually  better  if  a 
grid  is  only  partitioned  into  two  parts. 


Fig.  5: 


Part  of  a  finite  element  mesh  with  9000  elements  round 
an  airfoil. 


'sssilg^aamg 

Bate;  ■^Mllillilfc.: 


Fig.  7: 


Partitioning  into  8  subgrids  of  the  RYMAMO  grid  with 
the  greedy  heuristic  of  Farhat. 
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5.2  Repeated  x-bisection 

The  simplest  choice  for  a,-  is  cr,-  =  x,-  with  x^  the 
x-coordinate  of  the  grid  point.  Recursive  applic¬ 
ation  of  this  technique  on  the  subgrids  gives  rise 
to  a  stripwise  partitioning,  with  strips  parallel 
to  the  y-axis.  Such  a  partitioning  causes  long 
inter-subgrid  interfaces  and  thus  a  large  commu¬ 
nication  volume. 

5.3  Recursive  Coordinate  Bisection 

Recursive  Coordinate  Bisection  (RGB)  [27],  also 
called  Orthogonal  Recursive  Bisection  (ORB)  [3], 
consists  of  alternately  bisecting  the  grid  accord¬ 
ing  to  the  X-,  and  y-coordinate  and,  for  three- 
dimensional  grids,  the  z-coordinate.  This  tech¬ 
nique  leads  to  grids  with  a  better  aspect  ratio 
than  the  ones  that  are  obtained  with  repeated  x- 
bisection.  This  has  a  positive  effect  on  the  com¬ 
munication  volume. 

5.4  Recursive  Inertial  Bisection 

Using  the  x-,  y-  or  ^-coordinate  of  the  grid  points 
has  the  disadvantage  that  the  partitioning  de¬ 
pends  on  the  coordinate  system  used,  which  is 
not  an  intrinsic  problem  characteristic. 


Angular  momentum  of  a  discrete 
point  set. 

The  basic  idea  behind  the  inertial  bisection 
strategy  is  the  following.  The  principal  inertia 
direction  of  an  object  (or  a  discrete  point  set)  is 
the  direction  for  which  the  rotational  inertial  mo¬ 
mentum  /  =  is  minimal  when  this  dir¬ 

ection  is  taken  as  the  rotation  axis  (see  Fig.  8). 
If  the  domain  is  more  or  less  convex-shaped,  the 
minimal  momentum  axis  will  be  aligned  with  the 
overall  shape  of  the  grid.  We  can  therefore  expect 
that  the  grid  will  have  its  smallest  spatial  extent 
in  the  direction  orthogonal  to  this  axis  of  rota¬ 
tion.  This  direction  is  then  heuristically  chosen 


to  be  the  bisection  direction. 

The  inertia  directions  of  the  mesh  are  the  eigen¬ 
vectors  Ii ,  I2  and  I3  corresponding  to  the  eigen¬ 
values  Ai  <  A2  <  A3  of  the  3x3  inertia  matrix: 


with 

Ixx  —  ^  ^  {Vi  ~  J/c)  "b  {^i  ~  ^c)  1 

i 

^yy  ~  'y  ^  ~  ^c)  +  ~  ■2c)  , 

t 

Izz  —  {^i  ~  ^c)  +  {Vi  ~  Vc)  ) 

i 

Ixy  —  lyx  =  —  ^  {Xi  —  Xc)  (yi  —  yc) , 

t 

^yz  —  ^zy  —  ~  {m  ~  Uc)  i^i  —  Zc) , 

i 

hx  -  Ixz  =  -  ^  (zi  -  2c)  {Xi  -  Xc)  , 

t 

where  the  summations  must  be  taken  over  all  the 
grid  points  and  where  (xi,  y,-,  2;,)  and  (xc,  yc,  2c) 
respectively  denote  the  coordinates  of  the  grid 
points  and  the  coordinates  of  the  center  of  gravity 
of  the  mesh.  The  eigenvector  Ii  which  is  associ¬ 
ated  with  the  smallest  eigenvalue  corresponds  to 
the  axis  of  minimal  angular  momentum.  Once  Ii 
is  determined,  the  grid  points  are  projected  (or¬ 
thogonal  projection)  onto  it  and  this  projection 
is  used  as  the  separator  field  cr,-  for  the  bisection 
of  the  grid. 

The  Recursive  Inertial  Bisection  (RIB)  algorith- 
m,  also  called  the  Inertial  Recursive  Bisection 
(IRB)  [29]  or  the  Recursive  Principal  Inertia 
(RPI)  [26]  algorithm,  is  more  expensive  than  re¬ 
peated  x-bisection  or  Recursive  Coordinate  Bi¬ 
section  but  generally  gives  much  better  results. 
Because  the  minimal  rotational  momentum  axis 
is  an  inherent  property  of  the  grid,  this  parti¬ 
tioning  does  not  depend  on  the  orientation  of  the 
coordinate  system.  It  still  depends  however,  on 
the  relative  scaling  of  the  x-,  y-  and  z-axes. 
Recursive  Inertial  Bisection  is  now  a  widely  used 
partitioning  technique  [8,  30],  especially  in  com¬ 
bination  with  the  Kernighan-Lin  heuristic  (see 
Section  3.2.4). 

5.5  Some  examples 

We  will  first  make  a  comparison  between  the  res¬ 
ults  obtained  with  repeated  x-bisection.  Recurs¬ 
ive  Orthogonal  Bisection,  and  Recursive  Inertial 
Bisection.  These  methods  have  been  thoroughly 


Mesh 


Repeated  a;-bisection 


Fig.  9; 


Narrowing  curved  channel  (structured  grid)  and  parti¬ 
tions  with  geometry-based  methods. 


studied  and  compared  with  each  other  in  [29]. 
The  following  examples  have  been  taken  from  it. 
Figure  9  shows  a  structured  finite  volume  mesh 
for  a  narrowing  curved  channel.  It  consists  of 
768  cells  and  1472  edges.  It  is  partitioned  into 
16  parts,  using  the  repeated  x-bisection,  Recurs¬ 
ive  Coordinate  Bisection  and  Recursive  Inertial 
Bisection  heuristics.  The  load  balance  is  in  all 
cases  (nearly)  perfect,  but  the  number  of  edges 
cut  by  the  partition  interfaces  is  respectively  324, 
236  and  191.  Notice  that  for  this  structured  grid 
of  48  X 16  cells,  an  optimal  partitioning  can  easily 
be  found  by  splitting  it  into  8x2  nearly  square 
subgrids  of  each  6x8  cells.  In  this  case  only  160 
edges  are  cut  by  the  partition  interfaces. 
However  for  unstructured  meshes  like  the  one  in 
Fig.  10,  such  an  optimal  partitioning  cannot  be 
found  so  easily.  This  mesh  is  used  to  calculate  the 
supersonic  flow  through  a  channel  with  a  forward 
step.  It  consists  of  1186  cells  and  1652  edges.  It 
is  again  partitioned  into  16  parts.  The  number 
of  edges  cut  by  the  partition  interfaces  for  the  re¬ 
peated  x-bisection.  Recursive  Coordinate  Bisec¬ 
tion,  and  Recursive  Inertial  Bisection  heuristics 
is  respectively  430,  297  and  281.  Examination 
of  the  figures  reveals  how  in  the  Recursive  Iner¬ 
tial  Bisection  method  the  axis  direction  adapts  it¬ 
self  to  the  non-convexity  of  the  narrowing  curved 
channel  and  to  the  increased  mesh  density  in  the 
channel.  The  above  experiments  illustrate  that 


the  Recursive  Inertial  Bisection  heuristic  should 
be  prefered  over  the  other  two  geometry-based 
techniques. 

Let  us  now  compare  the  Recursive  Inertial  Bi¬ 
section  algorithm  with  Farhat’s  greedy  heuristic. 
Figure  11  shows  the  partitioning  into  eight  sub¬ 
grids  that  one  obtains  with  the  Recursive  Inertial 
Bisection  algorithm  of  the  grid  in  Fig.  5.  This 
partition  cuts  515  edges,  which  is  more  than  with 
Farhat’s  greedy  heuristic.  Also  notice  that  some 
subgrids  are  quite  elongated,  which  might  be  a 
problem  with  some  iterative  methods  [7]. 

Inertial  bisection  implicitly  assumes  that  the 
mesh  is  convex.  For  the  RYMAMO  model,  this 
is  obviously  not  the  case.  We  can  therefore  ex¬ 
pect  the  Recursive  Inertial  Bisection  algorithm  to 
perform  poorly.  Figure  12  shows  the  partitioning 
into  eight  subgrids.  This  partition  yields  20  con¬ 
nected  parts,  less  than  Farhat’s  greedy  heuristic, 
but  it  cuts  485  edges  which  is  more  than  the  365 
edges  that  are  cut  by  the  greedy  heuristic. 

6  THE  RECURSIVE  SPECTRAL  BI¬ 
SECTION  ALGORITHM 

6.1  Introduction 

The  use  of  spectral  methods  to  bisect  graphs  was 
first  considered  by  Donath  and  Hoffman  [31],  and 


Repeated  x-bisection 


Fig.  11: 


Partitioning  into  8  subgrids  of  the  finite  element  grid  in 
Fig.  5  with  the  Recursive  Inertial  Bisection  algorithm. 


Fig.  10: 


Channel  with  forward  step  (unstructured  grid)  and  par¬ 
titions  with  geometry-based  methods. 


Recursive  Coordinate  Bisection 


Recursive  Inertial  Bisection 
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Fig.  12: 


Partitioning  into  8  subgrids  of  the  RYMAMO  grid  with 
the  Recursive  Inertial  Bisection  algorithm. 


since  then,  spectral  methods  for  computing  vari¬ 
ous  graph  parameters  have  been  used  by  several 
others. 

Barnes  [32]  introduced  a  bisection  technique  that 
uses  the  eigenvectors  corresponding  to  the  largest 
two  eigenvalues  of  the  adjacency  matrix  of  the 
graph. 

The  most  frequently  used  spectral  bisection  tech¬ 
nique  was  introduced  by  Pothen  et  al.  [33].  In 
this  method,  the  graph  is  bisected  according  to 
the  eigenvector  that  corresponds  to  the  second 
smallest  eigenvalue  of  the  Laplacian  matrix  of 
the  graph. 

By  recursively  applying  the  spectral  bisection  al¬ 
gorithm  to  the  subgraphs,  it  is  possible  to  par¬ 
tition  a  graph  into  2, 4, . . . ,  2“^  subgraphs.  This 
heuristic  was  first  used  to  partition  finite  element 
meshes  by  Simon  [27],  who  used  the  name  Re¬ 
cursive  Spectral  Bisection,  and  by  Williams  [3] 
who  named  the  method  Eigenvalue  Recursive  Bi¬ 
section. 


6.2  The  algorithm 


Intuitively,  it  is  not  immediately  obvious  that  the 
second  eigenvector  of  the  Laplacian  matrix  of  a 
graph  is  a  good  separator  for  the  graph.  The 
following  deduction  of  the  algorithm  helps  to  un¬ 
derstand  why  this  is  nevertheless  the  case. 

We  denote  the  graph  by  G  =  (P,  E)  where  V  = 
{ui,  U2)  •  •  M  ^n}  is  the  vertex  set,  and  E  is  the 
edge  set.  A  weight  Wg(eij)  is  associated  with 
each  edge  Cij  G  E.  This  graph  is  completely 
determined  by  its  adjacency  matrix  A.  This  is  a 
symmetric  n  x  n  matrix  with  elements. 


an 

€l{j 


=  0, 


r  Wij  if  e,j  G  E, 
\  0  otherwise. 


i  =  l,...,n, 

i,j  ^l,...,n;i^  j. 


We  search  a  mapping  m  :  {1,  2, . . .,  N}  — )■  {1,  2} 
that  minimises: 

^e(Cij)  (1  ~  ^m{i),m{j))‘ 

eij  &E 

This  mapping  partitions  the  vertex  set  V  into  two 
subsets  Vi  and  P2‘ 

Pi  =  {vi  G  P  I  m{i)  =  1} , 

P2  =  {vi  G  P  I  m{i)  —  2}  . 

The  vertex  sets  Pi  and  P2  should  have  the  same 
number  of  elements. 

Let  a;  be  a  vector  of  length  n  whose  components 
are  defined  as  follows: 

f  —1  if  m(i)  =  1, 

~  I  1  if  m(i)  =  2. 

This  is  convenient  because  1  —  S,n(i),m(j)  = 
^(1  —  XiXj).  Moreover,  the  requirement  that 
Pi  and  P2  have  the  same  number  of  elements  is 
equivalent  to  2]]]Li  Xi  =  0. 

We  must  therefore  minimise 

i  w^(eij)  (1-XiXj), 


subject  to  X,-  G  {—1, 1}  and  Xi  =  0.  It  will 
prove  advantageous  to  add  to  this  expression  the 
term 

i=l 

Since  x,-  =  ±1,  this  term  is  zero  and  does  not 
change  the  value  of  the  object  function,  but  later 
on,  we  will  relax  the  constraint  x;  =  ±1  and  then 
this  term  will  become  important. 
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Using  the  notation  r  =  We  = 

Eeije£^e(eii),  D  -  Diag(t,)  a.nd  B  =  D  -  A, 
we  can  write  our  object  function  as 

This  function  must  be  minimised  subject  to  the 
constraints  x;  G  {-1, 1}  and  Xir=i  —  0- 
We  choose  the  diagonal  values  f,-  so  that  each  row 
sum  of  B  is  zero.  This  choice  is  convenient  for 
several  reasons; 

1.  Since  t,-  =  J2eij£E'^e{^ij)  implies  that  r  = 
2We,  the  first  term  in  the  object  function  is 
identically  zero. 

2.  The  matrix  B  is  positive  semidefinite.  If  the 
graph  is  connected,  then  B  only  has  a  single 
null  vector  consisting  entirely  of  I’s. 

3.  If  the  edge  weights  are  all  1,  then  the  diag¬ 
onal  elements  bn  of  B  are  equal  to  the  degree 
of  the  corresponding  vertex  u,-,  and  the  off- 
diagonal  elements  are  equal  to  —1  if  the 
corresponding  vertices  Vi  and  Vj  are  connec¬ 
ted  by  an  edge  and  are  equal  to  0  other¬ 
wise.  This  matrix  is  the  so-called  Laplacian 
matrix  of  the  graph,  and  we  can  say  that  in 
general  the  matrix  B  is  a  weighted  Lapla¬ 
cian.  This  is  advantageous  because  a  num¬ 
ber  of  interesting  properties  about  the  Lapla¬ 
cian  matrix  that  can  give  us  some  guarantees 
about  the  quality  of  the  solution,  are  already 
known  [34,  35].  We  will  say  more  about  the 
Laplacian  matrix  later. 

Using  the  notation  e  =  [11...  1]^,  our  discrete 
minimisation  problem  becomes: 

Minimise  \  x'^Bx,  subject  to 
Xi  €  {-1, 1},  and  e^x  =  0. 

This  minimisation  problem  is  still  a  discrete  NP- 
complete  problem.  We  now  relax  the  constraint 
that  each  of  the  components  of  the  vector  x  must 
be  ±1.  Instead,  we  impose  the  norm  constraint 
x^x  =  n.  In  this  way  we  replace  our  discrete 
problem  by  the  following  continuous  one: 

Minimise  \  x^Bx,  subject  to  x^x  =  n, 
e^x  =  0,  and  x,-  6  IR.  (i  =  1, 2, . . .,  n). 

It  must  be  noticed  that  all  the  feasible  solutions  of 
the  discrete  problem  are  also  feasible  solutions  of 
the  continuous  problem.  Therefore,  the  solution 
of  the  continuous  problem  provides  a  lower  bound 
for  the  solution  of  the  discrete  one.  Contrary  to 
the  discrete  problem  however,  the  continuous  op¬ 
timisation  problem  can  be  solved  easily  thanks 
to  the  special  properties  of  the  matrix  B  : 


1.  B  is  symmetric  positive  semidefinite  ; 

2.  The  eigenvectors  of  B  can  always  be  chosen 
to  be  pairwise  orthogonal  ; 

3.  The  vector  e  is  an  eigenvector  of  B  with  ei¬ 
genvalue  zero  ; 

4.  If  the  graph  is  connected,  e  is  the  only  ei¬ 
genvector  of  B  with  eigenvalue  zero. 

Let  0  =  Ai  <  A2  <  A3  <  •  •  •  <  A„  be  the  ei¬ 
genvalues  of  B  with  corresponding  orthonormal 
eigenvectors  e  =  . . . , We  can  write 

X  as  X  =  cie  -t-  C2V?‘  -| - h  Cnu'^.  Hence,  x^x  = 

3.nd  the  requirement  that  e^x  =  0  is 
satisfied  if  and  only  if  cj  =  0.  Therefore,  our 
minimisation  problem  can  be  formulated  as: 

Minimise  i  subject  to 

^"=2  Ci  G  m  (t  =  2, . . . ,  n). 

If  A2  <  A3,  the  object  function  is  minimised  for 
C2  =  i/n,  and  C3  =  •  •  •  =  =  0. 

As  the  solution  of  the  original  discrete  optimisa¬ 
tion  problem,  we  take  the  vector  x  with  compon¬ 
ents  Xi  G  {  —  1,1}  that  lies  closest  to  the  solution 
of  the  continuous  problem.  We  obtain  this  vec¬ 
tor  by  finding  the  median  value  among  all  the  x^’s 
and  mapping  x,-  values  above  the  median  to  -hi, 
and  values  below  to  —1.  This  gives  a  balanced 
decomposition  with  hopefully,  a  low  cut-weight. 

6.3  Example 

We  illustrate  the  spectral  bisection  algorithm 
by  applying  it  to  the  mesh  on  the  left  in 
Fig.  13  [36].  The  corresponding  interdependency 
graph  is  shown  on  the  right  side  of  the  figure. 
The  Laplacian  matrix  L  of  this  graph  is 


Example  mesh  and  corresponding  in¬ 
terdependency  graph  [36]. 
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The  second  smallest  eigenvalue  of  L  is  X2  — 
0.6972,  and  the  corresponding  eigenvector  is 

=  [  -0.78  -0.24  0.12  0.39  0.39  0.12]^. 

Therefore  the  vector  x  is 

x  =  [-1.91  -0.58  0.29  0.96  0.96  0.29]^. 

On  the  basis  of  this  we  can  partition  the  mesh 
as  Pi  =  {1,2,3}  and  P2  =  {4,5,6},  or  Pi  = 
{1, 2, 6}  and  P2  =  {3, 4, 5}. 


6.4  Laplacian  spectrum  of  regular  grids 

In  general,  dropping  the  discreteness  constraint 
in  an  optimisation  problem  and  taking  the  dis¬ 
crete  solution  that  lies  closest  to  the  solution  of 
the  relaxed  continuous  optimisation  problem  as 
the  solution  of  the  discrete  optimisation  problem 
is  a  dangerous  technique  that  does  not  guarantee 
good  solutions. 

Confidence  can  be  gained  about  the  fact  that  us¬ 
ing  the  spectrum  of  the  Laplacian  matrix  of  a 
graph  does  yield  good  partitionings,  by  study¬ 
ing  this  spectrum  for  regular  grid  graphs.  The 
following  results  for  the  path  graph  and  for  the 
five-point  grid  have  been  taken  from  [33]. 

6.4.1  The  path  graph 


1  2  3  4  5 

• - • - • - • - • 


Fig.  14: 


Path  graph  with  5  vertices. 


Let  Pn  denote  the  path  graph  on  n  vertices  (see 
Fig.  14).  In  the  following  discussion,  we  assume 
that  n  >  2  is  even.  We  number  the  vertices  of 
the  path  from  1  to  «  in  the  natural  order  from 
left  to  right. 

The  Laplacian  matrix  of  is  tridiagonal.  Its 
eigenvalues  are  Xk  —  4sin^[(A:  —  l)7r/(2n)]  {k  = 
l,...,n),  thus  0  =  Ai  <  A2  <  •  •  •  <  A„.  An 
eigenvector  that  corresponds  to  the  eigenvalue 
Afc  has  components 


‘  “  2n 


i  =  1, . . .,  n. 


Therefore,  A2  =  4  sin^[7r/(2n)],  and  x?  = 

cos[(2i  —  1)  7r/(2n)].  The  components  of  plot¬ 
ted  against  the  vertices  of  decrease  monoton- 
ically  from  left  to  right.  The  first  n/2  compon¬ 
ents  are  positive  and  the  last  n/2  are  negative. 
Thus  bisection  based  on  the  separator  <7^  =  xf 
splits  the  path  graph  in  the  middle.  Intuitively, 
it  is  clear  that  this  is  the  optimal  partitioning. 


6.4.2  The  five-point  grid 


We  consider  the  m  X  n  five-point  grid,  and 
without  loss  of  generality  take  m  <  n.  We  as¬ 
sume  that  n  >  2  is  even. 

The  spectrum  of  the  five-point  grid  can  be  de¬ 
rived  from  the  spectrum  of  the  path  graph.  The 
eigenvalues  are 


m  =  4 


(^) 


-f  sin^ 


\  2m  ) 


k=  1, . . .,  n;  /  =  1, . . .,  m. 

An  eigenvector  that  corresponds  to  the  ei¬ 
genvalue  fjLkj  has  components 

. ,  _  (2i-l)(t-l);r  (2j-l)(l-l)T 

2n  2ra 

i  =  1,  ...,n;i  =  l,...,m. 

The  smallest  eigenvalue  yi^i  is  zero.  If  m  < 
n,  the  second  smallest  eigenvalue  is  /i2,i  = 
4sin^[7r/(2n)]  and  the  corresponding  eigenvector 
jias  components  =  cos[(2i  —  1)  tt/ (2n)]. 
The  components  of  are  constant  along  each 
column  of  m  vertices,  and  the  components  de¬ 
crease  from  left  to  right  across  a  row.  Columns 
numbered  from  1  to  n/2  have  positive  compon¬ 
ents,  and  the  rest  of  the  columns  have  negative 
components.  The  components  of  this  eigenvector 
of  the  m  X  n  five- point  grid  are  shown  in  Fig.  15. 


If  m  =  n,  yi^2  =  M2,i  and  the  second  smallest 
eigenvalue  of  the  Laplacian  matrix  has  multipli¬ 
city  two.  The  linearly  independent  eigenvectors 
t/^’^  and  j/^4  span  the  two-dimensional  eigenspace 
that  correspond  to  this  eigenvalue.  These  vectors 
correspond  respectively  to  bisecting  the  graph 
horizontally  and  vertically  in  the  middle,  which 
are  indeed  optimal  solutions.  Notice  that  in  prac¬ 
tice,  an  eigensolver  will  in  general  yield  vectors 
that  are  linear  combinations  of  these  two  optimal 
solutions. 


6.5  Connectivity  of  the  subgraphs 

If  the  original  graph  is  connected,  it  can  be  guar¬ 
anteed  that  at  least  one  of  the  resulting  subgraphs 
will  be  connected  too.  This  follows  from  the  fol¬ 
lowing  theorem  by  Fiedler  [35]. 
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Fig.  15: 


The  second  Laplacian  eigenvector  of 
the  five-point  grid. 


Theorem.  Let  G  be  a  connected  graph  and  let 
X  be  an  eigenvector,  corresponding  to  the  second 
smallest  eigenvalue  of  the  Laplacian  matrix  of  the 
graph.  For  a  real  number  r  >  0,  define  Fi(r)  = 
{u  €  V'  I  Xu  >  — r}.  Then  the  subgraph  induced 
by  Vi(r)  is  connected.  Similarly,  the  subgraph, 
induced  by  the  set  V2(r)  =  {v  \  Xy  <  r),  is 
also  connected.  Ifr  =  0,  it  is  necessary  to  include 
the  vertices  with  zero  components  in  both  sets  Vi 
and  V2  for  the  theorem  to  hold. 

In  practice,  most  often  both  subgraphs  will  be 
connected.  It  must  be  noticed  that  in  general  it 
is  not  possible  to  bisect  a  connected  graph  in  two 
connected  and  equally  sized  subgraphs  anyway. 
A  simple  example  of  such  a  graph  is  shown  in 
Fig.  16. 
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Fig.  16: 


A  connected  graph  that  cannot  be  bi¬ 
sected  in  connected  and  equally  sized 
subgraphs. 


6.6  Examples  and  experiments 

In  Table  1,  some  results  obtained  by  Ven- 
katakrishnan  et  al.  [37]  are  presented.  A  mesh- 
vertex  upwind  finite  volume  scheme  was  used  on 
a  64-processor  iPSC/860  machine  to  solve  the 
Euler  equations  around  a  multi-flap  airfoil  on 
Barth5,  a  two-dimensional  triangular  unstruc¬ 
tured  fluid  dynamics  mesh  from  NASA  Ames 
with  15606  vertices,  45878  edges,  30269  faces 
and  949  boundary  edges.  The  finite  volume  mesh 
was  partitioned  using  the  Spectral  Bisection  and 
the  Coordinate  Bisection  technique  on  the  ori¬ 
ginal  mesh-graph  and  using  the  Spectral  Bisec¬ 
tion  method  on  the  dependency  graph  of  the 
mesh.  For  this  example,  the  use  of  the  Spectral 
Bisection  technique  leads  to  a  performance  that 
is  30%  higher  than  if  the  Coordinate  Bisection 
method  is  used. 

Figure  17  shows  the  partitioning  into  eight  sub¬ 
grids  of  the  mesh  in  Fig.  5  that  one  obtains  with 
the  Recursive  Spectral  Bisection  algorithm.  This 
method  cuts  only  258  edges  (97  less  than  the 
greedy  heuristic  of  Farhat),  and  yields  eight  con¬ 
nected  subgrids.  Notice  also  that  the  subgrids 
have  good  aspect  ratios. 

For  the  RYMAMO  grid  as  well,  the  Recursive 
Spectral  Bisection  algorithm  gives  very  good  res¬ 
ults.  Figure  18  shows  the  partitioning.  It  cuts 
280  edges  (greedy  heuristic:  365)  and  yields 
10  connected  parts  (greedy  heuristic:  22). 


6.7  Generalisations  of  the  spectral  bisec¬ 
tion  algorithm 

Hendrickson  and  Leland  extended  the  spectral 
bisection  method  to  quadri-  and  octasection  of 
graphs  [38].  Moreover,  they  also  generalised  it 
to  the  case  that  not  only  the  edges  but  also  the 
vertices  are  weighted  [39].  Hendrickson  and  Le¬ 
land  show  that  the  partitions  that  are  obtained  in 
this  way  are  better  than  the  ones  obtained  by  re¬ 
cursively  applying  the  bisection  algorithm  if  the 
hypercube  hop  (or  Manhattan)  metric  is  used  as 
the  cost  measure.  Empirical  study  [40]  has  shown 
that  this  is  an  appropriate  measure  for  modelling 
the  performance  of  hypercube  architecture  ma¬ 
chines  since  minimising  this  metric  corresponds 
to  minimising  congestion  within  the  communica¬ 
tion  network.  The  hop  metric  is  also  appropriate 
for  and  three  dimensional  mesh  architectures. 
Van  Driessche  and  Roose  [5,  41]  developed  a 
spectral  bisection  algorithm  for  the  constrained 
graph  bisectioning  problem,  a  generalisation  of 
the  graph  bisectioning  problem  in  which  the  as¬ 
signment  of  part  of  the  vertices  is  imposed  a  pri¬ 
ori.  Although  this  spectral  algorithm  was  origin¬ 
ally  developed  for  dynamic  load  balancing,  it  is 


Table  1: 


Comparison  between  Spectral  Bisection  and  Coordinate 
Bisection. 


Method 

Spectral 

Bisection 

Coordinate 

Bisection 

Spectral 
(Depend.  Graph) 

Total  time  (sec) 

0.31 

0.41 

0.31 

Performance  (Mflops) 

187.5 

143 

188 

Communication  time  (sec) 

0.084 

0.173 

0.082 

Average  number  of  neighbours 

4.7 

6.7 

4.5 

Number  of  intern,  bound,  vertices 

1819 

2631 

1791 

Maximum  number  of  neighbours 

12 

14 

14 

Maximum  number  of  vertices 

101 

120 

109 

Fig.  17: 


Partitioning  into  8  subgrids  of  the  finite  element  grid  in 
Fig.  5  with  the  Recursive  Spectral  Bisection  algorithm. 


Fig.  18: 


Partitioning  into  8  subgrids  of  the  RYMAMO  grid  with 
the  Recursive  Spectral  Bisection  algorithm. 
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also  useful  for  static  load  balancing.  By  solving  a 
sequence  of  constrained  graph  bisectioning  prob¬ 
lems,  it  is  possible  to  take  the  mapping  problem 
already  into  account  during  the  mesh  partition¬ 
ing  [42].  In  this  way,  it  is  possible  to  ensure  that 
the  subgrids  are  assigned  to  processors  that  are 
close  to  each  other  in  the  communication  topo¬ 
logy.  Moreover,  the  number  of  neighbouring  sub¬ 
grids  per  subgrid  is  smaller  than  if  the  standard 
Recursive  Spectral  Bisection  is  used. 

Figure  19  shows  the  partitioning  of  the  mesh  in 
Fig.  5,  that  is  yielded  by  this  Recursive  Con¬ 
strained  Spectral  Bisection  algorithm  for  a  hy¬ 
percube  topology.  This  partition  cuts  slightly 
more  edges  than  the  partition  one  obtains  with 
the  standard  Recursive  Spectral  Bisection  al¬ 
gorithm  (viz.  282  versus  258)  but  the  maximal 
number  of  adjacent  subgrids  per  subgrid  is  smal¬ 
ler  (viz.  4  versus  5).  Moreover,  a  small  readjust¬ 
ment  of  the  boundaries  is  sufficient  to  ensure  that 
each  subgrid  has  no  more  than  3  neighbouring 
subgrids.  Fig.  20  illustrates  that  the  interdepend- 
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(a)  Interdependency  topology  of  the 
subgrids  in  Fig.  17  (standard  Bisec¬ 
tion  Algorithm),  (b)  Interdepend¬ 
ency  topology  of  the  subgrids  in 
Fig.  19  (Constrained  Bisection  Al¬ 
gorithm). 

ency  topology  between  the  subgrids  more  closely 
matches  the  hypercube  communication  topology 
of  the  parallel  computer,  than  if  the  standard  Re¬ 
cursive  Spectral  Bisection  algorithm  is  used. 


6.8  Calculation  of  the  eigenvectors 

For  the  spectral  bisection  technique,  one  has  to 
calculate  the  eigenvector  that  corresponds  to  the 
second  smallest  eigenvalue  of  a  large,  sparse  (and 
symmetric)  matrix.  The  Lanczos  algorithm  [43] 
is  particularly  well-suited  for  this  problem  be¬ 
cause  it  only  uses  the  matrix  through  matrix- 
vector  products,  so  that  the  sparsity  of  the  Lapla- 


cian  matrix  can  be  exploited,  and  because  it  typ¬ 
ically  converges  to  the  extreme  eigenvalues  and 
eigenvectors  of  a  matrix  in  0(-v/n)  steps,  each  of 
which  has  a  complexity  of  order  n. 

For  large  graphs,  the  calculation  time  and  espe¬ 
cially  the  memory  requirements  of  the  Lanczos 
algorithm  are  often  unacceptable.  Barnard  and 
Simon  [44,  45]  introduced  a  multilevel  algorithm 
that  calculates  the  Fiedler  vector  considerably 
faster  and  with  less  memory  than  the  Lanczos 
algorithm.  This  algorithm  first  constructs  a  se¬ 
quence  of  graphs,  in  such  a  way  that  the  initial 
graph  is  the  first  graph  in  the  sequence,  and  that 
the  other  graphs  are  the  contractions  of  the  pre¬ 
vious  graph  in  the  sequence. 

A  graph  is  contracted  as  follows.  First,  a  max¬ 
imal  number  of  non-adjacent  vertices  are  selec¬ 
ted  that  will  form  the  vertex  set  of  the  contrac¬ 
ted  graph.  Next,  the  edge  set  is  constructed  by 
growing  domains  in  the  original  graph  round  the 
vertices  of  the  contracted  graph,  and  by  adding 
an  edge  to  the  contracted  graph  whenever  two 
domains  intersect. 

Once  the  sequence  of  graph  contractions  has  been 
constructed,  the  Fiedler  vector  of  the  smallest 
graph  is  calculated  and  is  prolongated  to  the  pre¬ 
vious  graph  in  the  sequence.  This  prolongation 
is  already  a  good  approximation  for  the  Fiedler 
vector  of  this  graph  and  can  therefore  be  rapidly 
improved  with  Rayleigh  quotient  iteration.  This 
procedure  is  recursively  applied  until  the  Fiedler 
vector  of  the  first  graph  in  the  sequence,  i.e.  the 
Fiedler  vector  of  the  original  graph,  has  been  cal¬ 
culated. 

Using  this  technique,  Barnard  and  Simon  claim 
to  obtain  partitions  with  comparable  quality  in 
up  to  20  times  less  time  than  with  the  Lanczos 
algorithm. 

Van  Driessche  and  Roose  [46]  have  presented  an 
alternative  graph  contraction  algorithm  that  uses 
the  same  procedure  to  select  the  vertex  set  but 
that  assigns  weights  to  the  edges  of  the  contrac¬ 
ted  graph.  This  algorithm  yields  very  good  ei¬ 
genvector  approximations.  They  are  also  able  to 
give  a  formal  analysis  that  helps  to  explain  why 
and  when  the  algorithm  gives  such  good  results. 
Hendrickson  and  Leland  [47]  use  a  completely 
different  graph  contraction  algorithm,  in  which 
they  contract  some  edges  of  the  graph.  They 
first  search  for  a  maximal  matching  in  the  graph. 
This  is  a  maximal  set  of  edges,  no  two  of  which 
are  incident  on  the  same  vertex.  The  edges  in  this 
set  are  then  contracted  as  follows.  The  vertices 
joined  by  an  edge  that  must  be  contracted,  are 
merged  into  one  ‘super  vertex’,  and  the  new  super 
vertex  is  given  edges  to  the  union  of  the  neigh¬ 
bours  of  the  merged  vertices.  The  weight  of  the 
super  vertex  is  set  equal  to  the  sum  of  the  weights 
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Fig.  19: 


Partitioning  into  8  subgrids  of  the  finite  element  grid  in 
Fig.  5  with  the  Recursive  Constrained  Spectral  Bisec¬ 
tion  algorithm. 
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of  its  constituent  vertices.  Edge  weights  are  left 
unchanged  unless  both  merged  vertices  are  ad¬ 
jacent  to  the  same  neighbour.  In  this  case  the 
new  edge  that  represents  the  two  original  edges 
is  given  a  weight  equal  to  the  sum  of  the  weights 
of  the  two  edges  it  replaces. 

7  COMBINATIONS  OF  PARTITION¬ 
ING  ALGORITHMS 

The  most  powerful  partitioning  heuristics  use  a 
combination  of  the  techniques  discussed  in  the 
previous  sections.  We  will  discuss  three  ex¬ 
amples,  viz.  the  combination  of  recursive  spec¬ 
tral  and  inertial  bisection  with  the  Kernighan- 
Lin  heuristic,  the  two-step  approach  of  Vander- 
straeten  and  Keunings  to  optimise  complicated 
cost  functions,  and  the  multilevel  partitioning  al¬ 
gorithm  of  Bui  and  Jones  and  Hendrickson  and 
Leland. 

7.1  Improving  a  partition  with  the 
Kernighan-Lin  heuristic 

The  Kernighan-Lin  heuristic  is  an  iterative  al¬ 
gorithm  that  improves  an  initial  partition  by 
repeatedly  swapping  elements  among  the  parti¬ 
tions.  Starting  with  a  random  assignment  of  grid 
points  to  processors  usually  gives  disappointing 
results  because  of  the  inherently  greedy  and  local 
nature  of  the  algorithm.  On  the  other  hand, 
the  Recursive  Spectral  Bisection  algorithm  often 
yields  partitions  that  are  globally  good  but  that 
perform  poorly  in  the  fine  details.  It  is  therefore 
advantageous  to  calculate  an  initial  partitioning 


with  the  Recursive  Spectral  Bisection  algorithm, 
and  improve  this  with  the  Kernighan-Lin  heur¬ 
istic.  As  an  example,  in  Fig.  17,  the  boundar¬ 
ies  between  the  subdomains  are  not  very  smooth 
but  this  is  considerably  improved,  and  the  num¬ 
ber  of  cut  edges  reduced  from  258  to  226,  with 
the  Kernighan-Lin  heuristic.  Figure  21  shows  the 
partitioning  into  eight  subgrids  of  the  mesh  in 
Fig.  5,  that  results  from  applying  the  Kernighan- 
Lin  heuristic  to  the  partitioning  in  Fig.  17.,  Notice 
that  the  boundaries  between  the  subgrids  have 
become  much  smoother. 

The  Recursive  Inertial  Bisection  algorithm  also 
benefits  greatly  from  a  Kernighan-Lin  post¬ 
processing  step.  The  quality  of  the  resulting  par¬ 
titions  is  often  comparable  to  what  one  obtains 
with  Recursive  Spectral  Bisection  (but  worse 
than  what  the  combination  of  spectral  bisection 
with  Kernighan-Lin  gives),  while  the  calculation 
time  is  considerably  lower. 

Figure  22  shows  the  partitioning  of  the 
RYMAMO  mesh  after  applying  the  Kernighan- 
Lin  heuristic  to  the  result  of  the  inertial  bisec¬ 
tion  algorithm.  This  partition  has  13  connected 
parts  (20  without  the  Kernighan-Lin  heuristic) 
and  cuts  281  edges  (485  without  the  Kernighan- 
Lin  heuristic),  thus  only  1  edge  more  than  the 
partition,  obtained  with  the  Recursive  Spectral 
Bisection  algorithm. 

7.2  The  multilevel-Kernighan-Lin  algo¬ 
rithm  of  Hendrickson  and  Leland 

7.2.1  Description 

The  good  performance  of  the  Kernighan-Lin 
heuristic  at  locally  improving  a  partition  that  is 


Fig.  21: 


Partitioning  into  8  subgrids  of  the  finite  element  grid  in 
Fig.  5  with  the  Recursive  Spectral  Bisection  algorithm 
and  the  Kernighan-Lin  heuristic. 


Fig.  22: 


Partitioning  into  8  subgrids  of  the  RYMAMO  grid 
with  the  Recursive  Inertial  Bisection  algorithm  and  the 
Kernighan-Lin  heuristic. 
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already  globally  good,  is  also  put  to  use  in  the 
multilevel  algorithms  of  Bui  and  Jones  [48]  and 
Hendrickson  and  Leland  [47]. 

The  idea  is  to  create  a  sequence  of  increasingly 
smaller  graphs  that  in  some  sense  approximate 
the  original  graph.  The  smallest  graph  is  parti¬ 
tioned  (with,  say.  Recursive  Spectral  Bisection) 
and  this  partition  is  projected  back  through  the 
intermediate  levels.  Every  few  levels  of  projec¬ 
tion,  the  Kernighan-Lin  heuristic  is  used  to  refine 
the  partition. 

For  the  construction  of  the  smaller  graphs, 
Hendrickson  and  Leland  use  the  same  contraction 
algorithm,  described  in  Section  6.8,  that  they  use 
in  the  multilevel  algorithm  for  the  calculation  of 
the  Fiedler  vector  of  a  graph.  Bui  and  Jones  have 
proposed  a  similar  algorithm,  but  in  contrast  to 
the  algorithm  of  Hendrickson  and  Leland,  it  does 
not  use  vertex  and  edge  weights.  In  practice,  the 
difference  between  the  two  methods  is  small  with 
neither  method  being  consistently  superior  [49]. 

7.2.2  Examples 

Figure  23  shows  the  partitioning  into  eight  sub¬ 
grids  of  the  mesh  in  Fig.  5,  which  one  obtains 
with  the  multilevel  algorithm  of  Hendrickson  and 
Leland.  This  partition  cuts  202  edges,  fewer  than 
the  partition  that  we  obtained  with  the  Recurs¬ 
ive  Spectral  Bisection  algorithm  even  if  it  is  im¬ 
proved  with  the  Kernighan-Lin  heuristic.  Notice 
that  the  subgrids  have  very  good  aspect  ratios. 
For  the  RYMAMO  grid  as  well,  the  multilevel  al¬ 
gorithm  finds  a  partition  that  only  cuts  a  small 
number  of  edges.  Figure  24  shows  the  parti¬ 
tion  into  eight  subgrids  It  cuts  246  edges  (greedy 
heuristic:  365,  Recursive  Spectral  Bisection  al¬ 
gorithm:  280). 

Table  2,  which  is  taken  from  [30],  gives  res¬ 
ults  about  the  partitioning  of  Barth5,  a  two- 
dimensional  fluid  dynamics  mesh,  for  which  we 
presented  results  in  Section  6.6  that  demonstrate 
the  influence  of  the  partitioning  on  the  calcula¬ 
tion  time  of  an  Euler  solver.  The  dual  graph, 
which  has  15606  vertices  and  45878  edges,  was 
partitioned  into  2,  4,  8,  16,  32  and  64  parts,  both 
with  the  inertial  and  the  spectral  algorithm,  alone 
and  in  combination  with  the  Kernighan-Lin  heur¬ 
istic.  The  graph  was  also  partitioned  with  the 
multilevel  algorithm. 

A  comparison  of  the  number  of  cut  edges  on 
one  hand  and  the  calculation  times  on  the  other, 
demonstrates  that  the  combination  of  Recursive 
Inertial  Bisection  with  the  Kernighan-Lin  heur¬ 
istic  yields  partitions  of  comparable  quality  with 
the  Recursive  Spectral  Bisection  algorithm  at  a 
fraction  of  the  cost.  However,  the  most  cost- 
effective  method  turns  out  to  be  the  multilevel 


Partitioning  of  Barth5  with  the  mul¬ 
tilevel  algorithm  of  Hendrickson  and 
Leland,  and  with  the  Recursive  In¬ 
ertial  and  Spectral  Bisection  al¬ 
gorithms,  both  with  and  without 
Kernighan-Lin  refinement  [30]. 


algorithm:  it  finds  partitions  that  are  comparable 
to  or  even  better  than  what  one  obtains  with  a 
combination  of  the  Recursive  Spectral  Bisection 
algorithm  and  the  Kernighan-Lin  heuristic  while 
the  calculation  time  (and  the  memory  usage)  is 
considerably  smaller. 


7.3  Improving  a  partition  with  a  stochas¬ 
tic  optimisation  algorithm 

Vanderstraeten  and  Keunings  have  tried  to  im¬ 
prove  an  initial  partition  with  stochastic  op¬ 
timisation  algorithms  [50].  They  have  tested 
three  algorithms,  viz.  simulated  annealing  (see 
Section  3.1  and  the  references  therein),  tabu 
search  [51,  52],  and  stochastic  evolution  [53]. 
These  algorithms  are  expensive,  but  they  can 
start  from  a  good  initial  solution.  Moreover,  only 
a  relatively  small  search  space  must  be  explored 
because  only  subdomain  interfaces  are  readjus¬ 
ted. 

This  two-step  approach,  first  generating  an  ini¬ 
tial  mesh  decomposition  with  a  suboptimal  but 
fast  partitioning  algorithm,  and  next  optim¬ 
ising  this  partition  with  a  stochastic  optimiza¬ 
tion  algorithm,  is  able  to  generate  partitions  with 
smooth  boundaries  and  a  small  number  of  cut 
edges.  Moreover,  thanks  to  the  general  applicab¬ 
ility  of  the  stochastic  algorithms,  it  is  also  pos¬ 
sible  to  optimise  much  more  complicated  cost 
functions  that  do  not  just  take  the  number  of  cut 
edges  into  account  [54]. 


Fig.  23: 


Partitioning  into  8  subgrids  of  the  finite  element  grid  in 
Fig.  5  with  the  multilevel  algorithm  of  Hendrickson  and 
Leland. 


Fig.  24:  . 

Partitioning  into  8  subgrids  of  the  RYMAMO  grid  with 
the  multilevel  algorithm  of  Hendrickson  and  Leland. 
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8  SOFTWARE  TOOLS  FOR  PARTI¬ 
TIONING 

8.1  Introduction 

Graph  partitioning,  and  hence  mesh  partitioning 
for  parallel  computing,  is  by  now  a  fairly  well- 
understood  problem,  and  several  efficient  soft¬ 
ware  tools  exist  for  this  purpose.  We  will  discuss 
two  tools  that  contain  state-of-the-art  partition¬ 
ing  algorithms  and  that  are  well-supported  and 
-documented,  viz.  Chaco  and  TOP/DOMDEC . 
For  most  CFD  applications  these  tools  will  be 
sufficient  to  obtain  good  mesh  partitions,  so  that 
it  is  not  necessary  for  users  to  develop  their  own 
code. 

Some  open  issues  remain  however.  Firstly,  these 
tools  are  meant  to  be  used  as  a  pre-processing 
tool  on  a  sequential  computer.  They  are  there¬ 
fore  not  suitable  for  parallel  applications  that  also 
want  to  calculate  the  partitions  in  parallel,  like 
e.g.  applications  that  need  quasi-static  load  bal¬ 
ancing.  Secondly,  it  has  been  argued  that  the 
standard  optimisation  criterion  (minimal  number 
of  cut  edges)  is  not  suitable  for  many  applica¬ 
tions  [8,  30,  54].  Although  the  tools  that  will  be 
discussed  here,  provide  a  limited  number  of  op¬ 
timisation  criteria  besides  this  standard  criterion, 
it  is  still  not  clear  whether  any  of  these  can  ac¬ 
curately  model  the  execution  of  an  application  on 
a  parallel  computer,  where  factors  like  commu¬ 
nication  link  contention  and  cache  usage,  which 
are  difficult  to  model,  often  greatly  influence  per¬ 
formance. 

8.2  Chaco 

8.2.1  Introduction 

Chaco  is  a  software  package  designed  to  partition 
graphs.  It  was  written  by  Bruce  Hendrickson  and 
Robert  Leland  of  Sandia  National  Laboratories 
(Albuquerque,  New  Mexico,  USA).  Version  1.0 
was  released  in  1993.  The  much  improved  Ver¬ 
sion  2.0  will  be  released  in  May  1995.  It  is  this 
version  that  will  be  discussed  here. 

8.2.2  Description 

Chaco  implements  four  classes  of  global  parti¬ 
tioning  algorithms: 

Simple:  three  very  simple  partitioning  schemes, 
in  which  vertices  are  assigned  to  processes  ran¬ 
domly  or  according  to  their  numbering  in  the  ori¬ 
ginal  graph. 

Inertial:  recursive  inertial  bi-,  quadri-  or  octas- 
ection  (see  Section  5) . 


Spectral:  recursive  spectral  bi-,  quadri-  or 
octasection  (see  Section  6).  The  user  can  specify 
whether  the  eigenvectors  of  the  Laplacian  matrix 
must  be  calculated  with  a  Lanczos  algorithm  or 
with  a  multilevel  algorithm. 

Multilevel:  the  multilevel  algorithm  described 
in  Section  7.2. 

The  output  of  any  of  these  global  methods  can  be 
fed  into  a  Kernighan-Lin  algorithm  which  locally 
refines  the  partition. 

Chaco  only  offers  graph  partitioning  and  uses  a 
non-graphics  interface,  so  there  are  no  visualisa¬ 
tion  tools,  or  tools  to  create  meshes.  However, 
several  people  have  written  MATLAB  interfaces 
for  Chaco.  In  particular,  John  Gilbert  at  Xerox 
Park  has  written  and  agreed  to  maintain  visual¬ 
isation  software  that  is  freely  available. 

Chaco  is  normally  used  interactively  with  the 
program  prompting  for  the  name  of  input  and 
output  files,  for  the  number  of  sets  the  graph 
should  be  partitioned  into,  and  also  for  data 
about  the  requested  partitioning  heuristic.  The 
behaviour  of  Chaco  is  determined  by  a  large 
number  of  parameters  and  tolerances,  for  which 
the  program  chooses  suitable  default  values. 
However,  the  user  can  create  a  file  with  alternat¬ 
ive  values,  and  is  thus  able  to  experiment  with 
Chaco.  Although  normally  used  interactively, 
Chaco  also  provides  an  interface  routine  that  al¬ 
lows  it  to  be  called  from  user  code. 

8.2.3  Availability 

Chaco  is  available  under  license  from  Sandia  Na¬ 
tional  Laboratories.  It  is  distributed  along  with 
technical  documentation  and  some  sample  input 
files  via  e-mail.  To  obtain  a  copy,  contact  the 
authors 

Bruce  Hendrickson 
Dept.  1422,  Mail  Stop  1110 
Sandia  National  Laboratories 
Albuquerque,  NM  87185,  U.S.A. 

E-mail:  bah@cs.sandia.gov 

and 

Robert  Leland 
Dept.  1424,  Mail  Stop  1110 
Sandia  National  Laboratories 
Albuquerque,  NM  87185,  U.S.A. 

E-mail:  leland@cs.sandia.gov 

At  the  time  of  writing  this  text,  licensing  condi¬ 
tions  for  academics  were  not  completely  fixed. 
For  corporations,  Chaco  will  be  licensed  on  a 
case-by-case  basis. 
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Chaco  is  written  in  Kernighan  and  Ritchie  style, 
but  ANSI-compliant  C,  and,  except  for  the  math¬ 
ematics  library,  uses  no  external  libraries.  It 
should  therefore  compile  and  run  correctly  un¬ 
der  any  UNIX  system  with  any  ANSI-C  standard 
compiler,  and  can  usually  be  compiled  without 
too  many  problems  with  non-standard  compilers 
cLS  well. 

8.3  TOP/DOMDEC 

8.3.1  Introduction 

TOP/DOMDEC  is,  in  the  words  of  the  manual 
[55],  a  Totally  Object  oriented  Package  for  visu¬ 
alisation,  DOMain  DEComposition,  and  parallel 
processing  on  finite  element  meshes.  It  was  de¬ 
veloped  by  PGSoft  and  by  the  research  group 
of  C.  Far  hat  at  the  University  of  Colorado  at 
Boulder. 

As  a  partitioning  tool,  TOP/DOMDEC  of¬ 
fers  several  state-of-the-art  mesh  partitioning 
algorithms,  whose  partitions  can  subsequently 
be  smoothed  and  optimised  using  one  of 
several  non-deterministic  optimisation  schemes. 
TOP/DOMDEC  also  provides  real-time  means 
for  assessing  a  priori  the  quality  of  a  mesh  par¬ 
tition  and  discriminating  between  different  par¬ 
titioning  algorithms.  The  user  interface  includes 
high  speed  three-dimensional  graphics,  an  inter¬ 
processor  communication  simulator  with  a  built- 
in  cost  model  for  some  real-world  parallel  com¬ 
puters  and  for  a  generic  message-passing  parallel 
computer,  and  an  output  function  that  automat¬ 
ically  generates  parallel  I/O  data  structures. 

8.3.2  Description 

Here,  we  will  only  concisely  describe  TOP/ 
DOMDEC  as  a  mesh  partitioning  tool.  A  more 
thorough  discussion,  which  also  discusses  the 
other  capabilities  of  TOP/DOMDEC  can  be 
found  in  the  manual  [55],  or  in  Chapter  9  of  [26]. 
Just  like  in  Chaco,  the  idea  in  TOP/DOMDEC 
is  that  you  first  partition  the  mesh  with  a  global 
partitioning  algorithm,  and  that  this  initial  par¬ 
tition  is  subsequently  refined  with  a  local  optim¬ 
isation  algorithm.  TOP/DOMDEC  provides  the 
following  global  partitioning  algorithms: 

Greedy:  the  greedy  heuristic  of  Farhat  (see  Sec¬ 
tion  4). 

RCM  and  Recursive  RCM:  the  Reverse 
Cuthill- McKee  ordering  scheme  (RCM),  and  the 
Recursive  RCM  algorithm  (see  Section  3.2.2). 

Principal  Inertia  (PI)  and  Recursive  PI: 
the  Principal  Inertia  algorithm  projects  all  the 


mesh  points  onto  the  principal  inertia  direction  of 
the  mesh  and  sorts  the  mesh  points  according  to 
this  projection  into  the  requested  number  of  sub- 
domains.  The  Recursive  PI  algorithm  uses  the 
above  procedure  recursively  to  bisect  the  mesh 
and  submeshes  and  is  therefore  identical  to  the 
Recursive  Inertial  Bisection  heuristic,  described 
in  Section  5.4. 

Recursive  Spectral  Bisection:  the  standard 
Recursive  Spectral  Bisection  algorithm  (see  Sec¬ 
tion  6).  The  Fiedler  vector  is  calculated  with  the 
multilevel  algorithm  of  Barnard  and  Simon  [44] 
(see  Section  6.8). 

Recursive  Graph  Bisection:  the  Recursive 
Graph  Bisection  heuristic  described  in  Section 

3.2.4. 

ID  Topology  Frontal  Algorithm:  this  algo¬ 
rithm  tries  to  ensure  that  every  subdomain  has 
two  neighbours  at  most.  It  was  developed  to  par¬ 
tition  meshes  on  which  subdomain-based  multi- 
frontal  solution  schemes  are  used  [56]. 

Three  non-deterministic  optimisation  algorithms 
are  provided  to  further  optimise  the  parti¬ 
tions,  viz.  tabu  search,  simulated  annealing,  and 
stochastic  evolution.  Since  these  algorithms  are 
very  general,  it  is  possible  in  principle  to  op¬ 
timise  the  partitions  for  very  complicated  cost 
functions.  The  following  functions  are  provided 
in  TOP/DOMDEC:  interface  size,  subdomain 
frontwidth,  the  product  of  interface  size  and 
subdomain  frontwidth,  node-wise  load  balance, 
element-wise  load  balance,  edge-wise  load  bal¬ 
ance,  subdomains  aspect  ratio,  or  a  weighted  sum 
of  the  above  items. 

8.3.3  Availability 

To  obtain  TOP/DOMDEC,  contact 

Charbel  Farhat 
College  of  Engineering 
University  of  Colorado 
Campus  Box  429 
Boulder,  CO  80309,  U.S.A. 

E-mail:  charbelSboulder .  Colorado .  edu 

Users  must  pay  a  one-time  fee,  the  amount  of 
which  depends  on  whether  the  requestor  is  a  re¬ 
search  partner,  a  research  institution,  a  US  gov¬ 
ernment  sponsored  institution,  or  an  industrial 
corporation. 

TOP/DOMDEC  is  written  in  C-1— f.  It  cur¬ 
rently  runs  on  the  SGI  Iris  and  the  IBM  RISC 
System/6000  with  CL  graphics  workstations. 
However,  if  the  graphics  capabilities  are  not  re¬ 
quired,  TOP/DOMDEC  can  run  on  other  sys¬ 
tems  as  well. 
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1  Introduction 

In  these  notes  we  will  present  an  overview  of  a  num¬ 
ber  of  related  iterative  methods  for  the  solution  of 
linear  systems  of  equations.  These  methods  are 
so-called  Krylov  projection  type  methods  and  they 
include  popular  methods  as  Conjugate  Gradients, 
Bi-Conjugate  Gradients,  CGS,  Bi-CGSTAB,  QMR, 
LSQR  and  GMRES.  We  will  show  how  these  methods 
can  be  derived  from  simple  basic  iteration  formulas. 
We  will  not  give  convergence  proofs,  but  we  will  refer 
for  these,  as  far  as  available,  to  litterature. 

Iterative  methods  are  often  used  in  combination  with 
so-called  preconditioning  operators  (approximations 
for  the  inverses  of  the  operator  of  the  system  to  be 
solved).  Since  these  preconditioners  are  not  essential 
in  the  derivation  of  the  iterative  methods,  we  will  not 
give  much  attention  to  them  in  these  notes.  However, 
in  most  of  the  actual  iteration  schemes,  we  have  in¬ 
cluded  them  in  order  to  facilitate  the  use  of  these 
schemes  in  actual  computations. 

For  the  application  of  the  iterative  schemes  one  usu¬ 
ally  thinks  of  linear  sparse  systems,  e.g.,  like  those 
arising  in  the  finite  element  or  finite  difference  ap¬ 
proximations  of  (systems  of)  partial  differential  equa¬ 
tions.  However,  the  structure  of  the  operators  plays 
no  explicit  role  in  any  of  these  schemes,  and  these 
schemes  might  also  successfully  be  used  to  solve  cer¬ 
tain  large  dense  linear  systems.  Depending  on  the 
situation  that  might  be  attractive  in  terms  of  num¬ 
bers  of  floating  point  operations. 

It  will  turn  out  that  all  of  the  iterative  are  paral- 
lelizable  in  a  straight  forward  manner.  However,  es¬ 
pecially  for  computers  with  a  memory  hierarchy  (i.e., 
like  cache  or  vector  registers),  and  for  distributed 
memory  computers,  the  performance  can  often  be  im¬ 
proved  significantly  through  rescheduling  of  the  oper¬ 
ations.  We  will  discuss  parallel  implementations,  and 
occasionally  we  will  report  on  experimental  findings. 

2  Direct  versus  Iterative 

1.  Standard  Gaussian  elimination  leads  to  fill-in. 


and  this  makes  the  method  often  expensive. 
Usually  large  sparse  matrices  are  related  to  some 
grid  or  network.  In  a  3D  situation  this  leads  typ¬ 
ically  to  a  bandwidth  ~  (=  and  m®  =  n, 

1/m  the  gridsize). 

The  number  of  flops  is  then  typically  0{nm‘^)  ~ 
[36,  25].  For  2D  problems  the  bandwidth  is 
~  n2,  so  that  the  number  of  flops  for  a  direct 
method  then  varies  like  n^. 

If  one  has  to  solve  many  systems  with  different 
right-hand  sides,  then  one  has  to  decompose  the 
matrix  only  once  after  which  the  costs  for  solving 
each  system  will  vary  like  for  3D  problems, 
and  like  for  2D  problems. 


2.  For  symmetric  positive  definite  systems  the  er¬ 
ror  reduction  per  iteration  step  of  CG  is  ~  , 

with  K  =  Pl|2|iA-^||2  [14,  2,  35]. 

For  discretized  second  order  pde’s,  over  grids 
with  gridsize  T  we  typically  see  k  ~  m?.  Hence, 
for  3D  problems  we  have  that  k  ni ,  and  for 
2D  problems:  k  n.  For  an  error  reduction  of  e 
we  must  have  that 


—=y  PS  e  <  £. 


For  3D  problems  we  have  that 


=k 


j  ~ 


whereas  for  2D  problems: 


If  we  assume  the  number  of  flops  per  iteration  to 
be  ~  fn  (/  stands  for  the  number  of  nonzeros  per 
row  of  the  matrix  and  the  overhead  per  unknown 
introduced  by  the  iterative  scheme) 

=>  flops  per  reduction  with  e: 

~  —/ns  loge  for  3D  problems, 
and  ^  — /n2  loge  for  2D  problems. 
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Conclusion;  If  we  have  to  solve  one  system  at  a 
time,  then  for  large  n,  or  small  /,  or  modest  e: 

Iterative  methods  may  be  preferable. 

If  we  have  to,  solve  many  similar  systems  with  differ¬ 
ent  right-hand  side,  and  if  we  assume  their  number 
to  be  so  large  that  the  costs  for  constructing  the  de¬ 
composition  of  A  is  relatively  small  per  system,  then 
it  seems  likely  that  for  2D  problems  direct  methods 
may  be  more  efficient,  whereas  for  3D  problems  this 
is  still  doubtful,  since  the  flops  count  for  a  direct  so¬ 
lution  method  varies  like  na ,  and  the  number  of  flops 
for  the  iterative  solver  (for  the  model  situation)  varies 
like  ni . 

Example 

The  above  given  arguments  are  quite  nicely  illus¬ 
trated  by  observations  made  by  Horst  Simon  [74].  He 
expects  that  by  the  end  of  this  century  we  will  have  to 
solve  repeatedly  linear  problems  with  some  5x10^  un¬ 
knowns.  For  what  he  believes  to  be  a  model  problem 
at  that  time,  he  has  estimated  the  CPU  time  required 
by  the  most  economic  direct  method,  available  at 
present,  as  520, 040  years,  provided  that  the  compu¬ 
tation  can  be  carried  out  at  a  speed  of  1  TFLOP.  On 
the  other  hand,  he  estimates  the  CPU  time  for  pre¬ 
conditioned  conjugate  gradients,  assuming  still  a  pro¬ 
cessing  speed  of  1  TFLOPS,  as  576  seconds.  Though 
we  should  not  take  it  for  granted  that  in  particular 
the  preconditioning  part  can  be  carried  out  at  that 
high  processing  speed  (for  the  direct  solver  this  is 
more  likely),  we  see  that  the  differences  in  CPU  time 
requirements  are  gigantic,  indeed  (we  will  come  to 
this  point  in  more  detail). 

Also  the  requirements  for  memory  space  for  the  iter¬ 
ative  methods  are  typically  smaller  by  orders  of  mag¬ 
nitude.  This  is  often  the  argument  for  the  usage  of 
iterative  methods  in  2D  situations,  when  flop  counts 
for  both  classes  of  methods  are  more  or  less  compa¬ 
rable. 

Remarks: 

•  With  suitable  preconditioning  we  may  have 
-y/lc  ~  ns  and  the  flops  count  then  becomes 

~  -fni  loge, 

see,  e.g.,  [37]. 

•  For  classes  of  problems  some  methods  may  even 
be  faster:  multigrid,  fast  poisson  solvers. 

•  Storage  considerations  are  also  in  favour  of  iter¬ 
ative  methods. 

•  For  matrices  that  are  not  positive  definite  sym¬ 
metric  the  situation  can  be  more  problematic: 


it  is  often  difficult  to  And  the  proper  iterative 
method  or  a  suitable  preconditioner.  However, 
for  projection  type  methods,  like  GMRES,  Bi- 
CG,  CGS,  and  Bi-CGSTAB  we  often  see  that 
the  flops  counts  vary  as  for  CG. 

•  Iterative  methods  can  be  attractive  even  when 
the  matrix  is  dense.  Again,  in  the  positive  def¬ 
inite  symmetric  case,  if  the  condition  number  is 
then,  since  the  amount  of  work  per  iter¬ 
ation  step  is  ~  n^,  and  the  number  of  iteration 
steps  ~  ,  the  total  work  estimate  is  roughly 

proportional  to  ,  and  this  is  asymptoti¬ 
cally  less  than  the  amount  of  work  for  Choleski’s 
method,  which  varies  like  ~  n^. 

The  question  remains  at  the  moment  how  well  itera¬ 
tive  methods  can  take  advantage  of  modern  computer 
architectures.  From  Dongarra’s  Unpack  benchmark 
[22]  it  may  be  concluded  that  the  solution  of  a  dense 
linear  system  can  (in  principle)  be  computed  with 
computational  speeds  close  to  peak  speeds  on  most 
computers.  This  is  already  the  case  for  systems  of, 
say,  order  50000  on  parallel  machines  with  as  many 
as  1024  processors. 

In  sharp  contrast  with  the  dense  case  are  computa¬ 
tional  speeds  reported  in  [24]  for  the  preconditioned 
as  well  as  the  unpreconditioned  conjugate  gradient 
method  (ICCG  and  CG,  respectively). 

In  [24]  a  test  problem  was  taken,  generated  by  dis¬ 
cretizing  a  three-dimensional  elliptic  partial  differ¬ 
ential  equation  by  the  standard  7-point  central  dif¬ 
ference  scheme  over  a  three-dimensional  rectangular 
grid,  with  100  unknowns  in  each  direction  (m  =  100, 
n  =  1, 000,  000).  The  observed  computational  speeds 
for  several  machines  (1  processor  in  each  case)  are 
given  in  Table  1. 

3  Basic  iteration  method 

A  very  basic  idea,  that  leads  to  many  effective  itera¬ 
tive  solvers,  is  to  to  split  the  matrix  of  a  given  linear 
system  in  the  sum  of  two  matrices,  one  of  which  a 
matrix  that  would  have  led  to  a  system  that  can  eas¬ 
ily  be  solved.  The  most  simple  splitting  we  can  think 
of  is  A  =  I— {I— A).  Given  the  linear  system  Ax  =  b, 
this  splitting  leads  to  the  well-known  Richardson  it¬ 
eration: 

Xi+i  =  b  +  {I  -  A)xi  =  Xi  +  Vi. 
Multiplication  by  —A  and  adding  b  gives 
b  —  Axij^i  —b  —  Axi  —  Ati 
or 

ri+i  =  (J  -  A)ri  =  (I  -  A)'+Vo  =  Pi+i(A)ro, 
or,  in  terms  of  the  error 

A{x  -  Xi+i)  =  Pi+i{A)A{x  -  xo) 
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Table  1:  Speed  in  Megaflops  for  50  Iterations  of  the  Iterative  Techniques. 


Machine 

optimized 

ICCG 

Mflops 

Scaled 

CG 

Mflops 

Peak 

Performance 

Mflops 

NEC  SX-3/22  (2.9  ns) 

607 

1124 

2750 

CRAY  Y-MP  C90  (4.2  ns) 

444 

737 

952 

CRAY  2  (4.1  ns) 

96.0 

149 

500 

IBM  9000  Model  820 

39.6 

74.6 

444 

IBM  9121  (15  ns) 

10.6 

25.4 

133 

DEC  Vax/9000  (16  ns) 

9.48 

17.1 

125 

IBM  RS/6000-550  (24  ns) 

18.3 

21.1 

81 

CONVEX  C3210 

15.8 

19.1 

50 

Alliant  FX2800 

2.18 

2.98 

40 

^  X-  Xi  +  i  =  Pi  +  i{A){x  -  Xq). 

In  these  expressions  Pi^i  is  a  (special)  polynomial  of 
degree  i  -f  1.  Note  that  A+i(0)  =  l. 

Results  obtained  for  the  standard  splitting  can  be 
easily  generalized  to  other  splittings,  since  the  more 
general  splitting  A  —  M  —  N  =  M  —  {M  -  A)  can  be 
rewritten  as  the  standard  splitting  B  —  I  —  (I  —  B) 
for  the  preconditioned  matrix  B  =  M~^A.  The  the¬ 
ory  of  matrix  splittings,  and  the  analysis  of  the  con¬ 
vergence  of  the  corresponding  iterative  methods,  is 
treated  in  depth  in  [90].  We  will  not  discuss  this 
aspect  here,  since  it  is  not  relevant  at  this  stage. 
Instead  of  studying  the  basic  iterative  methods  we 
will  show  how  other  more  powerful  iteration  meth¬ 
ods  can  be  constructed  as  accelerated  versions  of  the 
basic  iteration  methods.  In  the  context  of  these  ac- 
celarated  methods,  the  matrix  splittings  become  im¬ 
portant  in  another  way,  since  the  matrix  M  of  the 
splitting  is  often  used  to  precondition  the  given  sys¬ 
tem.  That  is,  the  iterative  method  is  applied  to,  e.g., 
M~^Ax  —  M~^b.  We  will  return  to  this  later. 


to  other  elements  of  the  same  Krylov  subspaces. 
Let  us  write  such  an  element  still  as  Xi+i.  Since 
Xj+i  6  [A]  ro) ,  we  have  that 

with  Qi+i  an  arbitrary  polynomial  of  degree  z  -|-  1. 

It  follows  that 

ri+i  =  b-  Axi+i  =  (/  -  AQi+i(A))ro 
(3.0a)  =  Pi^i{A)ro, 

with,  iust  as  in  the  standard  Richardson  iteration, 

A+i(o)  =  1. 

The  Richardson  iteration  can  be  characterized  by  the 
polynomial  Pj.|.i(7l)  —  {I  —  A)*+^. 

Note  that  one  almost  never  computes  inverses  of 
matrices,  like  K~^,  explicitly.  Instead,  vectors  like 
fi  —  K~^b  —  Axi  —  K~^(b  —  Axi)  are  usually  com¬ 
puted  by  solving  from  Kfi  —b  —  Axi.  The  matrix 
K  is  often  sparse,  whereas  K~^  usually  is  not,  so  that 
this  procedure  is  much  more  efficient  both  in  CPU¬ 
time  and  in  computer  memory  space. 


From  now  on  we  will  assume  that  xq  =  0.  This  too 
does  not  mean  a  loss  of  generality,  for  the  situation 
Xq  0  can  through  a  simple  linear  transformation 
2  =  x  —  a;o  be  transformed  to  the  system 

Az  =  b  —  Axo  =  b 
for  which  obviously  zq  —  0. 

For  the  simple  Richardson  iteration  it  follows  that 

i 

=  ro  -b  Cl  -b  r2  -f - b  n-  =  ^(/  -  Ayro 

;=0 

e  {ro,Aro, .  ..,AWo}  =  K''^^{A-,ro). 


4  Towards  optimal  iteration  methods 


The  natural  question  arises  whether  we  can  pick  up 
a  better  from  the  Krylov  subspace  that  is  gen¬ 
erated  by  the  basic  iterative  method.  One  would  like 
to  see  the  ajj+i  for  which  Hxi-i-i  —  3;||2  is  minimal. 
E.g.,  a;i  e  {ro}  xi  =  aoro. 

\\x  -  Xi\\l  =  (x-  aoro,  x  -  Ooro)  = 

=  (x,  x)  -  2ao(x,  ro)  -b  a^(ro,  ro). 
Minimizing  with  respect  to  ao  gives 


(ro,x 

(ro,ro) 


and  this  is  not  practical,  since  x  is  unknown. 


Apparently,  the  Richardson  iteration  delivers  ele-  The  above  expression  for  ao  suggests  that  with  a 
ments  of  increasing  Krylov  subspaces.  Including  lo-  different  innerproduct  the  problem  might  be  solvable; 
cal  iteration  parameters  in  the  iteration  would  lead  (x,p)a  =  {x,Ay). 
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5  Symmetric  matrices 

If  A  is  symmetric  positive  definite  then  this  defines  a 
proper  innerproduct; 

(x,y)A  -  (y,x)A, 

(x,  x)a  =  0  4=^  X  =  0. 

Now  we  have  that 

-  xi\\\  =  (a;  -  ao^o,  a;  -  otoro)A 

_  iro,x)A  _  (,ro,Ax) 

~  {ro,ro)A  (ro,^ro)’ 

This  looks  promising  and  therefore  we  will  follow  that 
line. 

In  general  we  want  ||a:i  —  a:||>i  minimal  for  Xi  E 
KHA;ro) 

=>  Xi  -  X  ±A  K’(A;ro) 

^  ri  1  (A;  ro) , 

In  particular  ri  E  {ro,j4ro}.  Assuming  that  ri 
yro  (it  is  easy  to  check  that  in  that  case  ro  is  an 
eigenvector  of  A  and  the  process  could  be  stopped 
since  the  exact  solution  has  then  be  obtained  after 
only  one  iteration  step),  we  see  that  {ro,  ri}  form  an 
orthogonal  basis  for  K^{A',  ro). 

By  an  induction  argument  we  conclude  that  when  the 
process  does  not  find  the  exact  solution  at  or  before 
step  t  then 

{ro,ri, . .  .,r,} 

is  an  orthogonal  basis  for  K‘'^^(A;  ro). 

This  leads  to  the  idea  to  construct  an  orthogonal  ba¬ 
sis  for  the  Krylov  subspace,  a  basis  of  which  is  gen¬ 
erated  implicitly  by  the  standard  iteration  anyway, 
and  then  to  project  Xi  —  x,  with  respect  to  the  A- 
innerproduct,  onto  the  Krylov  subspace  and  to  de¬ 
termine  Xi  from  that. 

We  have  seen  that  the  rj  form  an  orthogonal  basis 
for  A'*(A;ro),  but  the  next  remarkable  property  is 
that  they  satisfy  a  3-term  recurrence  relation; 

(5.0a)  aj+iVj+i  =  Arj  -  fijrj  -  jjrj^i. 

The  proof  is  as  follows. 

?’i  E  K^(A;ro)  airi  =  Aro  -  /?oro 
r2  E  K^(A;ro)  r2  E  {ro,ri,A^ro} 

=>  r2  E  {ro,ri,  Ari} 

a2r2  =  Ari  -  /?iri  -  71  tq 

Now  we  use  an  induction  argument. 

rj-i  E  K^{A;ro)  Arj_i  E  /W^^(A;ro) 

=  {ro,ri,...,rj} 


I-i 

aji’j  —  Arj-i  —  Sjrj 
i=0 

Because  we  want  the  new  vector  rj  to  be  orthogonal 
with  respect  to  all  previous  ones,  the  constants  6i  are 
determined  by 

{Arj^i,rk)  -  Sk{rk,rk)  =  0 

(5.0b)  (Arj_i,rfc)  =  (rj_i,  An) 

(note  that  we  have  used  the  symmetry  of  A) 

=  {rj^i,ak+irk+i  +  Pkrk  +lkrk-i) 

Here  we  have  used  the  induction  argument  for  k.  Be¬ 
cause  of  the  orthogonality  it  follows  that  6k  =  0  for 
lb  =  0, . . . ,  i  —  3  and  hence  rj  also  satisfies  a  3-term 
recurrence  relation. 

The  values  for  f^j  and  7j  follow  from  the  orthogonality 
of  the  residual  vectors; 

=  irj,Arj)/{rj,rj), 

and 

7j  =  {rj-uArj)/{rj_i,rj^i). 

The  value  of  aj+i  determines  the  proper  length  of  the 
new  residual  vector.  From  the  consistency  relation 
(3.0a)  we  have  that  each  residual  can  be  written  as  ro 
plus  powers  of  A  times  tq.  Comparing  the  coefficient 
for  ro  in  the  recurrence  relation  (5.0a)  shows  that 

aj+i  +  l^j  +  Ij  =  0- 

At  the  end  of  this  section  we  will  consider  the  situa¬ 
tion  where  the  recurrence  relation  terminates. 

We  can  view  this  3-term  recurrence  relation  slightly 
different  as 

Arj  =  JjVj-i  +  /Sjrj  -I-  aj+irj+i 

If  we  consider  the  rj  as  being  the  j-th  column  of  the 
matrix 

Ri  -  {ro, .  • A_i) 

then  the  recurrence  relation  says  that  A  applied  to 
a  column  of  Ri  results  in  the  combination  of  three 
successive  columns,  or 
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or 

(5.0c)  ARi  =  RiTi  +  aiVicJ , 

in  which  T)  is  an  i  by  i  tridiagonal  matrix  and  is 
the  ith  canonical  vector  in 

Since  we  are  looking  for  a  solution  x,  in  K\A]  ro), 
that  vector  can  be  written  as  a  combination  of  the 
basis  vectors  of  the  Krylov  subspace,  and  hence 

Xi  =  Riy. 

(Note  that  y  has  i  components) 

Further  we  have  for  the  a:,-,  for  which  the  error  in 
yi-norm  is  minimal,  that 

RJ  {Axi  —  b)  =  0 

=>  RJ  ARiy  —  Rjb  =  0. 

Using  equation  (5.0c)  and  the  fact  that  r,  is  orthog¬ 
onal  with  respect  to  the  columns  of  Ri  we  obtain 

RjRiTiy  =  llroll^ei 

Since  Rj  Ri  is  a  diagonal  matrix  with  diagonal  ele¬ 
ments  llrolll  up  to  ||rj_i||2  we  find  the  desired  solu¬ 
tion  from 

iZ)  y  —  Cl  y  Xi  —  Ri  y . 

Note  that  so  far  we  have  only  used  the  fact  that  A 
is  symmetric  and  we  have  assumed  that  the  matrix 
Ti  is  not  singular.  We  will  see  later  that  this  opens 
the  possibility  for  several  suitable  iterative  methods, 
among  which  the  conjugate  gradients  method.  The 
Krylov  subspace  method  that  has  been  derived  here  is 
known  as  the  Lanczos  method  for  symmetric  systems 
[47].  We  will  exploit  the  relation  between  the  Lanczos 
method  and  the  conjugate  gradients  method  for  the 
analysis  of  the  convergence  behaviour  of  the  latter 
method. 

Note  that  for  some  j  <  n  —  1  the  construction  of 
the  orthogonal  basis  must  terminate.  In  that  case  we 
have  that  ARj —  Rj^iTj+i.  Let  y  be  the  solution 
of  the  reduced  system  Tj+iy  =  ei,  and  Xj+i  =  Rj+iy. 
Then  it  follows  that  Xj^i  —  x,  i.e.,  we  have  arrived  at 
the  exact  solution,  since  Axjj^i  —  b  —  ARj+iy  —  b  = 
Rj+iTj^iy  —  b  =  Rj^ici  —  6  =  0  (we  have  assumed 
that  Xq  =  0). 

5.1  THE  CG-METHOD: 

The  Conjugate  Gradients  CG  method  [41]  is  merely 
a  variant  on  the  above  approach,  which  saves  stor¬ 
age  and  computational  effort.  For,  when  solving  the 
projected  equations  in  the  above  way,  we  see  that  we 
have  to  save  all  columns  of  Ri  throughout  the  pro¬ 
cess  in  order  to  recover  the  current  iteration  vectors 
Xi-  This  can  be  done  cheaper.  If  we  assume  that  the 


matrix  A  is  in  addition  positive  definite  then,  because 
of  the  relation 

RJ ARi  -  rJ RiTi, 

we  conclude  that  Ti  can  be  transformed  by  a  rowscal¬ 
ing  matrix  Rf  Ri  into  a  positive  definite  symmetric 
tridiagonal  matrix  (note  that  Rj ARi  is  positive  def¬ 
inite  for  y  G  This  implies  that  T)  can  be  LU 

decomposed  without  any  pivoting: 

Ti  =  LiUi, 

with  Li  lower  unit  bidiagonal  and  {/,■  upper  bidiago¬ 
nal.  Hence 

Xi  =  Riy  =  RiT-\i  =  {RiU-^){LT\^) 

We  concentrate  on  the  factors,  placed  between  paren¬ 
thesis,  separately. 

1. 


With  q  =  L-  ^ei  we  have  that  q  can  be  solved 
from  Liq  =  ei  =>  fi-iqi-2  +  li-i  =  0  =>  qi-i,  in 
recursive  manner. 

2.  Write  Bi  =  RiU~^ ,  then  we  have  that 


do  gi 

di 


[  "  } 

^  ‘^i  —  l  —  9i  —  l{^Bi^i  —  2  ”t"  di  —  \{^Bi^i  —  \ 

=F  {Bi)i_i. 

Glueing  these  two  recurrences  together  we  obtain 


=  Xi^i  + 

and  this  is  in  fact  the  well-known  conjugate  gradients 
method.  The  name  stems  from  the  property  that  the 
update  vectors  usually  notated  as  pi_i,  are 

A-orthogonal. 

Note  that  the  positive  definiteness  of  A  is  only  ex¬ 
ploited  as  to  guarantee  the  flawless  decomposition  of 
the  implictly  generated  tridiagonal  matrix  Tj.  This 
suggests  that  the  conjugate  gradients  method  may 
also  work  for  certain  non  positive  definite  systems, 
but  then  at  our  own  risk  [59].  We  will  later  see  how 
other  ways  of  solving  the  projected  system  will  lead 
to  other  well-known  methods. 

5.1.1  Computational  notes 

The  standard  (unpreconditioned)  Conjugate  Gradi¬ 
ent  algorithm  for  the  solution  of  Ax  =  b  can  be  rep¬ 
resented  by  the  following  scheme: 


which  leads  to  the  remarkable  (and  known)  result 
that  for  this  preconditioned  system  we  still  minimize 
the  error  in  yl-norm,  but  now  over  a  Krylov  subspace 
generated  by  K~^ro  and  K~^A. 

In  the  following  computational  scheme  for  precon¬ 
ditioned  CG,  for  the  solution  of  Ax  =  b  with  precon¬ 
ditioner  K~^,  we  have  replaced  the  [  ,  ]-innerproduct 
again  by  the  familiar  standard  innerproduct.  E.g., 
note  that  with  fj+i  =  K~^Axi^i  —  K~^b  we  have 
that 

Pi  +  l  —  [f’i+1 1  i’t  +  l] 

=  (rj+i,7t:“Vi+i), 

and  is  the  residual  corresponding  to  the  pre¬ 

conditioned  system  K~^Ax  =  K~^b. 


xo=  initial  guess;  ro  =  b  —  Axq; 

=  0;/?-i  =  0; 

Po  =  {ro,ro) 
for  i  =  0, 1,  2, .... 

Pi  “  Cj  -{- 
qi  =  Api ; 

.  O'*  = 

Xi-^.\  —  Xi  OiPi, 
y’i  +  l  — 

if  Xi+i  accurate  enough  then  quit; 

Pi+i  =  (ri+i,  A+i); 

g.  —  £i±i. 

Pi  —  p.  , 

end; 

CG  is  most  often  used  in  combination  with  a  suit¬ 
able  splitting  A  =  K  —  R,  and  then  is  called 
the  preconditioner.  We  will  assume  that  K  is  also 
positive  definite. 

Note  first  that  the  CG  method  can  be  derived  for  any 
choice  of  the  innerproduct.  In  our  derivation  we  have 
used  the  standard  innerproduct  {x,y)  =  but 

we  have  not  used  any  specific  property  of  that  inner- 
product.  Now  we  make  a  different  choice: 

[x,y]  =  {x,Ky). 

It  is  easy  to  verify  that  K~^A  is  symmetric  positive 
definite  with  respect  to  [  ,  ]: 

[K~^Ax,y\  -  {K~'^Ax,Ky)  ~  {Ax,y) 

(5.1a)  =  {x,Ay)=[x,K~'^Ay]. 

Hence,  we  can  follow  our  CG  procedure  for  solving 
the  preconditioned  system  K~^Ax  =  K~^b,  using 
the  new  [  ,  ]-innerproduct. 

Apparently,  we  now  are  minimizing 

[xi  Xj  R  A(^Xi  x)]  —  (Xj  X,  A(xj  ^))) 


xo=  initial  guess;  vq  —  b  —  Axq; 
P_i  =  0;/?_i  =  0; 

Solve  Wo  from  Kwq  =  ro; 

Po  =  (ro,  Wo) 
for  i  =  0, 1,2, .... 

Pi  =  Wi  + 

qi  =  Api', 
n-  — 

Xj+i  =  Xi  +  aiPi; 


Xi+i  -  n  -  aiqi\ 

if  x,+i  accurate  enough  then  quit; 
Solve  Wi+i  from  Kwi+i  =  ri+i; 


Pi+i  —  (rj+i,  Wi+i); 

A’  —  > 


Note  that  this  formulation,  which  is  quite  popular, 
has  the  advantage  that  the  preconditioner  needs  not 
to  be  splitt  into  two  factors,  and  it  is  also  avoided  to 
backtransform  solutions  and  residuals,  as  is  necessary 
when  one  applies  CG  to  L'~^AL~^  y  =  L~^b. 

The  coefficients  Oj  and  /?j ,  generated  by  the  above 
scheme,  can  be  used  to  build  the  matrix  TJ  in  the 
following  way: 


Since  aj  >  0  and  >  0  we  see  that  the  above  ma¬ 
trix  is  similar  to  the  following  symmetric  tridiagonal 


matrix; 


The  eigenvalues  of  the  leading  order  minor  of  this 
matrix  are  the  Ritz  values  of  the  preconditioned  ma¬ 
trix  K~^A  with  respect  to  the  i-dimensional  Krylov 
subspace  spanned  by  the  first  i  residual  vectors.  The 
Ritz  values  approximate  the  (extremal)  eigenvalues 
of  the  preconditioned  matrix  increasingly  well.  These 
approximations  can  be  used  to  get  an  impression  of 
the  relevant  eigenvalues.  They  can  also  be  used  to 
construct  upperbounds  for  the  error  in  the  delivered 
approximation  with  respect  to  the  solution  [45,  40]. 
According  to  the  results  in  [80]  the  eigenvalue  infor¬ 
mation  can  also  be  used  in  order  to  understand  or 
explain  delays  in  the  convergence  behaviour. 

5.1.2  The  convergence  of  Conjugate  Gradi¬ 
ents 

The  conjugate  gradient  method  (here  with  K  =  I) 
constructs  in  the  iteration  step  an  x,,  which  can 
be  written  as 

Xi  —  X  —  Pi{A){xo  -  x)  (cf.  (3.0a)), 

such  that  ||xi  -  x\\a  is  minimal  over  all  polynomials 
Pi  of  degree  i,  with  Pj(0)  =  1. 

Let  us  denote  the  eigenvalues  and  the  orthonormal- 
ized  eigenvectors  of  A  by  \j,  Zj.  We  write  vq  — 
jj  Zj .  It  follows  that 

=  Pi{A)rQ  =  ^  llj 

3 

and  hence 


Note  that  only  those  \j  play  a  role  in  this  process 
for  which  7^  =  0.  In  particular,  if  A  happens  to 
be  semidefinite,  i.e.,  there  is  a  A  =  0,  then  this  is 
no  problem  for  the  minimization  process  as  long  as 
the  corresponding  coefficient  7  is  zero  as  well.  The 
situation  where  7  is  small,  due  to  rounding  errors,  is 
discussed  in  [45]. 

Upperbounds  on  the  error  (in  A-norm)  are  obtained 
by  observing  that 
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(5.1c)  <  maxg?(Aj)^^, 

Ajf 

for  any  arbitrary  polynomial  Qi  of  degree  i  with 
Qi{0)  =  1,  where  the  maximum  is  taken,  of  course, 
only  over  those  A  for  which  the  corresponding  7  0. 

When  Pi  has  zeros  at  all  the  different  Aj  then  n  =  0. 
The  conjugate  gradients  method  tries  to  spread  the 
zeros  in  such  a  way  that  Pi{Xj)  is  small  in  a  weighted 
sense,  i.e.,  \\xi  —  x\\a  is  as  small  as  possible. 

We  get  suitable  upperbounds  by  selecting  appro¬ 
priate  polynomials  for  Qi.  A  very  well-known  up- 
perbound  arises  by  taking  for  Qi  the  degree 
Chebychev  polynomial  transformed  to  the  interval 
[Amzn )  ^max  ]  and  scaled  such  that  its  value  in  0  is 
equal  to  1. 


(5. Id)  < 


max  |7)^(Aj)|  ||xo  -  x|| 

A 1 1 A  ft 


2 

A  f 


and 

(5.1e)  |i;.ft)|<2(^)‘ 

The  purpose  of  preconditioning  is  to  reduce  the  con¬ 
dition  number  k. 


As  we  have  seen  the  conjugate  gradients  algorithm 
is  just  an  efficient  implementation  of  the  Lanczos 
algorithm.  The  eigenvalues  of  the  implicitly  gener¬ 
ated  tridiagonal  matrix  T,  are  the  Ritz  values  of  A 
with  respect  to  the  current  Krylov  subspace.  It  is 
known  from  Lanczos  theory  that  these  Ritz  values 
converge  towards  the  eigenvalues  of  A  and  that  in 
general  the  extremal  eigenvalues  of  A  are  first  well 
approximated  [46,  58,  63].  Furthermore,  the  speed  of 
convergence  depends  on  how  well  these  eigenvalues 
are  separated  from  the  others  (gap  ratio)  [63] .  This 
helps  us  to  understand  the  so-called  superlinear  con¬ 
vergence  behaviour  of  the  conjugate  gradient  method 
(as  well  as  other  Krylov  subspace  methods).  It  can 
be  shown  that  as  soon  as  one  of  the  extremal  eigen¬ 
values  is  modestly  well  approximated  by  a  Ritz  value, 
the  pocedure  converges  from  then  on  as  a  process  in 
which  this  eigenvalue  is  absent,  i.e.,  a  process  with 
a  reduced  condition  number.  Note  that  superlinear 
convergence  behaviour  in  this  connection  is  used  to 
indicate  linear  convergence  with  a  factor  that  is  grad¬ 
ually  decreased  during  the  process  as  more  and  more 
of  the  extremal  eigenvalues  are  sufficiently  well  ap¬ 
proximated  (for  details  on  this  see  [80]). 

5.1.3  Further  references 

A  more  formal  presentation  of  CG,  as  well  as  many 
theoretical  properties,  can  be  found  in  the  textbook 
by  Hackbusch  [39].  A  shorter  presentation  is  given 
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in  [35].  An  overview  of  papers,  published  in  the  first 
25  years  of  existence  of  the  method,  is  given  in  [34], 
Vector  processing  and  parallel  computing  aspects  are 
discussed  in  [23]  and  [57]. 

5.2  MINRES  and  SYMMLQ: 

When  A  is  not  positive  definite,  but  still  symmetric, 
then  we  can  construct  an  orthogonal  basis  for  the 
Krylov  subspace,  as  we  have  seen  before.  We  write 
the  recurrence  relations  slightly  different  as 


with 


ARi  —  Ri+iTi, 


\ 

T 

i  +  1 

i 


In  this  case  we  have  the  problem  that  (  ,  does  not 
define  an  innerproduct.  However  we  can  still  try  to 
minimize  the  residual.  We  look  for  an 


5.3  Parallelism  and  data  locality  in  precondi¬ 
tioned  CG: 

For  successful  application  of  CG  one  needs  that  the 
matrix  A  is  symmetric  positive  definite.  In  other 
short  recurrence  methods,  other  properties  of  A  may 
be  desirable,  but  we  will  not  exploit  these  properties 
explicitly  in  the  discussion  on  parallel  aspects. 

Most  often,  the  conjugate  gradients  method  is  used 
in  combination  with  some  kind  of  preconditioning. 
This  means  that  the  matrix  A  can  be  thought  of  to 
be  multiplied  with  some  suitable  approximation  K~^ 
for  A~^.  Usually,  K  is  constructed  as  an  approxima¬ 
tion  of  A,  such  that  systems  like  Ky  =  z  are  much 
more  easy  to  solve  as  Ax  =  b.  Unfortunately,  a  pop¬ 
ular  class  of  preconditioners,  based  upon  incomplete 
factorization  of  A,  do  not  lend  themselves  very  much 
for  parallel  implementation.  We  will  discuss  some 
approaches  to  obtain  more  parallelism  in  the  precon¬ 
ditioner  in  section  9.1.  At  the  moment  we  will  assume 
that  the  preconditioner  is  chosen  such  that  the  par¬ 
allelism  in  solving  Ky  =  is  comparable  with  the 
parallelism  in  computing  Ap,  for  given  p. 

For  CG  it  is  also  required  that  the  preconditioner  K 
is  symmetric  positive  definite.  This  aspect  will  play  a 
role  in  our  discussions  since  it  shows  how  some  prop¬ 
erties  of  the  preconditioner  can  be  used  sometimes  to 
our  advantage  for  an  efficient  implementation. 


Xi  e  {ro,Aro,  ■ . .  ,A‘  Vq},  Xi  =  Rip 
\\Axi-b\\2  =  \\ARiy-b\\2 
=  \\Ri+iTiy  — 

Now  we  exploit  the  fact  that  Ri+iD~^-^,  with  A+i  = 
diag{\\ro\\2,  ||ri||2, ....  Unlb),  is  an  orthonormal  trans¬ 
formation  with  respect  to  the  current  Krylov  sub¬ 
space; 

\\Axi  -  6II2  =  WDij^ifip  —  ||ro||2ei||2 

and  this  final  expression  can  simply  be  seen  as  a  min¬ 
imum  norm  least  squares  problem. 

The  element  in  the  {i  -\-  l,i)  position  of  Ti  can  be 
transformed  to  zero  by  a  simple  Givens  rotation  and 
the  resulting  upper  bidiagonal  system  (the  other  sub¬ 
diagonal  elements  being  removed  in  previous  iteration 
steps)  can  simply  be  solved,  which  leads  to  the  so- 
called  MINRES  method  [60]. 

Another  possibility  is  to  solve  the  system  Tiy  = 
Ikolbei,  as  in  the  CG  method  {Ti  is  the  upper  i  by 
i  part  of  Tj.  Other  than  in  CG  we  cannot  rely  on 
the  existence  of  a  Choleski  decomposition  (since  A 
is  not  positive  definite).  An  alternative  is  then  to 
decompose  Tj  by  an  TQ-decomposition.  This  again 
leads  to  simple  recurrences  and  the  resulting  method 
is  known  as  SYMMLQ  [60]. 


The  scheme  for  preconditioned  CG  is  given  in  Sec¬ 
tion  5.1.1.  Note  that  in  that  scheme  the  updating  of  x 
and  r  can  only  start  after  the  completion  of  the  inner- 
product  required  for  Oj.  Therefore,  this  innerproduct 
is  a  so-called  synchronization  point:  all  computation 
has  to  wait  for  completion  of  this  operation.  One  can 
try  to  avoid  such  synchronization  points  as  much  as 
possible,  or  to  formulate  CG  in  such  a  way  that  syn¬ 
chronization  points  can  be  taken  together.  We  will 
see  such  approaches  further  on. 

Since  on  a  distributed  memory  machine  communi¬ 
cation  is  required  to  assemble  the  innerproduct,  it 
would  be  nice  if  we  could  proceed  with  other  useful 
computation  while  the  communication  takes  place. 
However,  as  we  see  from  our  CG  scheme,  there  is  no 
possibility  to  overlap  this  communication  time  with 
useful  computation.  The  same  observation  can  be 
made  for  the  updating  of  p,  which  can  only  take  place 
after  the  completion  of  the  innerproduct  for  Pi .  Apart 
from  the  computation  of  Ap  and  the  computations 
with  K,  we  need  to  load  7  vectors  for  10  vector  float¬ 
ing  point  operations.  This  means  that  for  this  part 
of  the  computation  only  10/7  floating  point  operation 
can  be  carried  out  per  memory  reference  in  average. 

Several  authors  ([11,  52,  53])  have  attempted  to  im¬ 
prove  this  ratio,  and  to  reduce  the  number  of  syn¬ 
chronization  points.  In  our  formulation  of  CG  there 
are  two  such  synchronization  points,  namely  the  com- 
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putation  of  both  innerproducts. 

Meurant  [52]  (see  also  [68])  has  proposed  a  variant 
in  which  there  is  only  one  synchronization  point, 
however  at  the  cost  of  a  possibly  reduced  numerical 
stability,  and  one  additional  innerproduct.  In  this 
scheme  the  ratio  between  computations  and  memory 
references  is  about  2. 

We  show  here  another  variant,  proposed  by  Chrono- 
poulos  and  Gear  [11], 

xo=  initial  guess;  ro  =  6  —  Ax  a] 

5_i  =  p_i  =  0;/?_i  =  0; 

Solve  mo  from  Kwq  =  ro; 

So  =  ylmo; 

po  =  {ro,wo)\na  -  (so,mo); 

<ao  =  Po/po; 

for  i  =  0, 1, 2, .... 

Pi  =  wi  +  A-m-i; 

H  =  Si  + 

^2+1  “  ”h  ^iPi] 

n+i  =  n  -  Oiqf, 

if  accurate  enough  then  quit; 

Solve  Wi+i  from  Kwi+i  —  r,+i; 

Sj-i-i  =  Awi^i; 

Pi+l  —  (^-1-1 ) 

Pi-i-l  —  5 

/?.  —  £i±i  . 

P*  -  Pi  > 

^i+1  —  - „  , — ; 

end  «; 

In  this  scheme  all  vectors  need  only  be  loaded  once 
per  pass  of  the  loop,  which  leads  to  a  better  exploita¬ 
tion  of  the  data  (improved  data  locality).  However, 
the  price  is  that  we  need  2n  flops  more  per  itera¬ 
tion  step.  Chronopoulos  and  Gear  [II]  claim  stabil¬ 
ity,  based  upon  their  numerical  experiments. 

Instead  of  2  synchronization  points,  as  in  the  stan¬ 
dard  version  of  CG,  we  have  now  only  one  synchro¬ 
nization  point,  as  the  next  loop  can  only  be  started 
when  the  innerproducts  at  the  end  of  the  previous 
loop  have  been  assembled.  Another  slight  advantage 
is  that  these  innerproducts  can  be  computed  in  par¬ 
allel. 

Chronopoulos  and  Gear  [11]  propose  to  further  im¬ 
prove  the  data  locality  and  parallelism  in  CG  by  com¬ 
bining  s  successive  steps.  Their  algorithm  is  based 
upon  the  following  property  of  CG.  The  residual  vec¬ 
tors  ro, ...,  r,-  form  an  orthogonal  basis  (assuming  ex¬ 
act  arithmetic)  for  the  Krylov  subspace  spanned  by 
ro,  A?’o, ...,  A®“^ro.  When  arrived  at  rj,  the  vectors 
ro,  ri, ...,  rj,  Arj,  ...,A^~^~^rj  also  form  a  basis  for 
this  subspace.  Hence,  we  may  combine  s  successive 
steps  by  generating  rj,Arj, ...,  A*~^rj  first,  and  then 
do  the  orthogonalization  and  the  updating  of  the  cur¬ 
rent  solution  with  this  blockwise  extended  subspace. 
This  approach  leads  to  a  slight  increase  in  flops  in 
comparison  with  s  successive  steps  of  the  standard 


CG,  and  also  one  additional  matrix  vector  product  is 
required  per  s  steps. 

The  main  drawback  in  this  approach  seems  to  be  the 
potential  numerical  instability.  Depending  on  the 
spectral  properties  of  A,  the  set  rj ,  ...,A"~^rj  may 
tend  to  converge  to  a  vector  in  the  direction  of  a 
dominating  eigenvector,  or,  in  other  words,  may  tend 
to  dependence  for  increasing  values  of  s.  The  authors 
claim  to  have  seen  successful  completion  of  this  ap¬ 
proach,  with  no  serious  stability  problems,  for  small 
values  of  s.  Nevertheless,  it  seems  that  s-step  CG, 
because  of  these  problems,  has  a  bad  reputation  (see 
also  [69]).  However,  a  similar  approach,  suggested  by 
Chronopoulos  and  Kim  [12]  for  other  processes  such 
as  GMRES,  seems  to  be  more  promising.  Several  au¬ 
thors  have  pursued  this  research  direction,  and  we 
will  come  back  to  this  in  section  7.3. 

We  consider  still  another  variant  of  CG,  in  which 
there  is  possibility  to  overlap  all  of  the  communica¬ 
tion  time  with  useful  computations.  This  variant  is 
just  a  reorganized  version  of  the  original  CG  scheme, 
and  is  therefore  precisely  as  stable.  The  key  trick  in 
this  approach  is  to  delay  the  updating  of  the  solution 
vector  by  one  iteration  step. 

Another  advantage  over  the  previous  scheme  is  that 
no  additional  operations  are  required. 

It  is  assumed  that  the  preconditioner  K  can  be  writ¬ 
ten  as  K  =  Furthermore,  it  is  assumed 

that  the  preconditioner  has  a  block  structure,  corre¬ 
sponding  to  the  gridblocks  assigned  to  the  processors, 
so  that  communication  (if  necessary)  can  be  over¬ 
lapped  with  computation. 

a;o=  initial  guess;  ro  —  b  -  Axq', 

p-i  =  0;/?_i  =  0;a_i  =  0; 

s  =  T“Vo; 

Po  =  (s,s); 

for  i  =  0, 1,2, .... 


Wi  —  L  s; 

(0) 

Pi  -Wi  +  I3i-ipi_i\ 

(1) 

II 

(2) 

T  =  {PiAi)\ 

(3) 

Xi  —  Xi  —  i  -j- 

(4) 

(5) 

ri+\  -  Vi  -  aiqi] 

(6) 

s  = 

(7) 

Pi-\-l  —  (^1  ^)i 

(8) 

if  r,q.i  small  enough  then 

^2*4-1  —  T  ^iPi 

quit; 

/?■  —  ■ 

Pi  —  Pi  ’ 

(9) 

end  i] 

Now  we  discuss  how  this  scheme  may  lead  to  an  ef¬ 
ficient  parallel  scheme,  and  how  local  memory  (vector 
registers,  cache,  ...)  can  be  exploited. 
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1.  All  computing  intensive  operations  can  be  car¬ 
ried  out  in  parallel.  Only  for  the  operations  (2), 
(3),  (7),  (8),  (9),  and  (0),  communication  be¬ 
tween  processors  is  required.  We  have  assumed 
that  the  communication  in  (2),  (7),  and  (0)  can 
be  largely  overlapped  with  computation. 

2.  The  communication  required  for  the  assembly  of 
the  innerproduct  in  (3)  can  be  overlapped  with 
the  update  for  x  (which  could  have  been  done  in 
the  previous  iteration  step). 

3.  The  assembly  of  the  innerproduct  in  (8)  can  be 
overlapped  with  the  computation  in  (0).  Also 
step  (9)  usually  requires  information  such  as  the 
norm  of  the  residual,  which  can  be  overlapped 
with  (0). 

4.  Steps  (1),  (2),  and  (3)  can  be  combined;  the 
computation  of  a  segment  of  pi  can  be  followed 
immediately  by  the  computation  of  a  segment  of 
qi  (2),  and  this  can  be  followed  by  the  compu¬ 
tation  of  a  part  of  the  innerproduct  in  (3).  This 
saves  on  load  operations  for  segments  of  pi  and 
qi- 

5.  Depending  on  the  structure  of  L,  the  computa¬ 
tion  of  segments  of  n+i  in  (6)  can  be  followed 
by  operations  in  (7),  which  can  be  followed  by 
the  computation  of  parts  of  the  innerproduct  in 
(8),  and  the  computation  of  the  norm  of  rj+i, 
required  for  (9). 

6.  The  computation  of  /?*■  can  be  done  as  soon  as  the 
computation  in  (8)  has  been  completed.  At  that 
moment,  the  computation  for  (1)  can  be  started 
if  the  requested  parts  of  Wi  have  been  completed 
in  (0). 

7.  If  no  preconditioner  is  used,  then  Wi  =  r*,  and 
steps  (7)  and  (0)  have  to  be  skipped.  Step  (8)  has 
to  be  replaced  by  pi  —  (rj+i,  ri+i).  Now  we  need 
useful  computation  in  order  to  overlap  the  com¬ 
munication  for  this  innerproduct.  To  this  end, 
one  might  split  the  computation  in  (4)  per  pro¬ 
cessor  in  two  parts.  The  first  of  these  parts  are 
computed  in  paralell  in  overlap  with  (3),  while 
the  parallel  computation  of  the  other  parts  is 
used  in  order  to  overlap  the  communication  for 
the  computation  of  p, . 


exploit  parallelism  in  combination  with  suitable  so¬ 
lution  techniques,  like  for  instance  iterative  solution 
methods. 

From  a  parallel  point  of  view  CG  mimics  very  well 
parallel  performance  properties  of  a  variety  of  it¬ 
erative  methods  such  as  Bi-CG,  CGS,  BiCGSTAB, 
QMR,  and  others. 

In  this  section  we  study  the  performance  of  CG  on 
parallel  distributed  memory  systems  and  we  report 
on  some  supporting  experiments  on  actual  existing 
machines.  Guided  by  our  experiments  we  will  discuss 
the  suitability  of  CG  for  Massively  Parallel  Process¬ 
ing  systems. 

All  computational  intensive  elements  in  precondi¬ 
tioned  CG  (updates,  innerproducts,  and  matrix  vec¬ 
tor  operations)  are  trivially  parallelizable  for  shared 
memory  machines  [23],  except  possibly  for  the  pre¬ 
conditioning  step;  Solve  Wi+i  from  A'lni+i  =  r-j+i. 
For  the  latter  operation  parallelism  depends  very 
much  on  the  choice  for  K.  In  this  section  we  restrict 
ourselves  to  block  Jacobi  preconditioning,  where  the 
blocks  have  been  chosen  so  that  each  processor  can 
handle  one  block  independently  of  the  others.  For 
other  preconditioners  that  allow  some  degree  of  par¬ 
allelism  see  [23]. 

For  a  distributed  memory  machine  at  least  some  of 
the  steps  require  communication  between  processors; 
the  accumulation  of  innerproducts  and  the  computa¬ 
tion  of  Api  (depending  on  the  non-zero  structure  of 
A  and  the  distribution  of  the  non-zero  elements  over 
the  processors).  We  consider  in  some  more  detail  the 
situation  where  A  is  a  block-tridiagonal  matrix  of  or¬ 
der  N,  and  we  assume  that  all  blocks  are  of  order 

Jn-. 

(  A,  Di  \ 

I  Di  A2  D2  I 


D2 


\ 


} 


in  which  the  Di  are  diagonal  matrices,  and  the  Ai  are 
tridiagonal  matrices.  Such  systems  occur  quite  fre¬ 
quently  in  finite  difference  approximations  in  2  space 
dimensions.  Our  discussion  can  easily  be  adapted  to 
3  space  dimensions. 


5.4.1  Processor  configuration  and  data  dis¬ 
tribution 


5.4  Parallel  performance  of  CG: 

Some  realistic  3D  computational  fluid  dynamics  sim¬ 
ulation  problems,  as  well  as  other  problems,  lead  to 
the  necessity  to  solve  linear  systems  Ax  —  b  with  a 
matrix  of  very  large  order,  billions  of  unknowns,  say. 
If  not  of  very  special  structure,  such  systems  are  not 
likely  to  be  solved  by  direct  elimination  methods. 

For  such  very  large  (sparse)  systems  we  will  have  to 


For  simplicity  we  will  assume  that  the  processors  are 
connected  as  a  2D  grid  with  p  x  p  =  P  processors. 
The  data  have  been  distributed  in  a  straight  forward 
manner  over  the  processor  memories  and  we  have  not 
attempted  to  fully  exploit  the  underlying  grid  struc¬ 
ture  for  the  given  type  of  matrix  in  order  to  reduce 
communication  as  much  as  possible.  In  fact  it  will 
turn  out  that  in  our  case  the  communication  for  the 
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processor  2 


processor  3 


etc. 


Fig.l:  Distribution  of  A  over  the  processors.. 


matrix  vector  product  plays  only  a  minor  role  for  ma¬ 
trix  systems  of  large  size  . 

Because  of  symmetry  only  the  3  non-zero  diagonals 
in  the  upper  triangular  part  of  A  need  to  be  stored, 
and  we  have  chosen  to  store  successive  parts  of  length 
N/P  of  each  diagonal  in  consecutive  neighbouring 
processors.  In  Figure  1  we  see  which  part  of  A  is 
represented  by  the  data  in  the  memory  of  a  given 
processor. 

The  blocks  for  block  Jacobi  are  chosen  to  be  the 
diagonal  blocks  that  are  available  on  each  processor, 
and  the  various  vectors  (rj,  pi,  etc.)  have  been  dis¬ 
tributed  likewise,  i.e.  each  processor  holds  a  section 
of  length  N/ P  of  these  vectors  in  its  local  memory. 

5.4.2  Required  Communication 

matrix  vector  product  It  is  easily  seen  for  a  2D  proces¬ 
sor  grid  (as  well  as  for  many  other  configurations,  in¬ 
cluding  hypercube  and  pipeline),  that  the  matrix  vec¬ 
tor  product  can  be  completed  with  only  neighbour- 
neighbour  communication.  This  means  that  the  com¬ 
munication  costs  do  not  increase  for  increasing  val¬ 
ues  of  p.  If  one  follows  a  domain  decomposition  way 
of  approach,  in  which  the  finite  difference  discretiza¬ 
tion  grid  is  subdivided  into  p  by  p  subgrids  (p  in  x- 
direction  and  p  in  p-direction),  then  the  communica¬ 
tion  costs  are  smaller  than  the  computational  costs 
by  a  factor  of 

In  [17]  much  attention  is  given  to  this  sparse  ma¬ 
trix  vector  product  and  it  is  shown  that  the  time  for 
communication  can  almost  completely  be  overlapped 
with  computational  work.  Therefore,  with  adequate 
coding  the  matrix  vector  products  do  not  necessarily 
lead  to  serious  communication  problems,  even  not  for 


relatively  small-sized  problems. 

On  a  MEIKO  SPl  (located  at  Utrecht  University,  this 
machine  has  only  4  processors)  we  have  observed,  for 
N  =  90000,  a  speed-up  by  a  factor  of  1.85  for  two 
processors,  and  of  1.96  when  overlap  possibilities  are 
exploited.  In  both  cases  we  expect,  by  extrapolat¬ 
ing  our  timing  results,  a  factor  of  2  for  very  large  N. 
According  to  a  naive  interpretation  of  Amdahl’s  law 
we  might  expect  a  severe  degradation  in  performance 
for  more  than  two  processors.  However,  if  we  in¬ 
crease  the  size  of  the  problem  for  increasing  numbers 
of  processors  then  the  local  communication  time  for 
the  matrix  product  does  not  increase  so  that  it  does 
not  pose  limits  on  the  performance  when  we  increase 
the  value  of  p. 

vector  update  In  our  case  these  operations  do  not  re¬ 
quire  any  communication  and  we  should  expect  linear 
speed  up  when  increasing  the  number  of  processors 
P. 

inner  product  For  the  innerproduct  we  need  global 
communication  for  assembly  and  we  need  global  com¬ 
munication  for  the  distribution  of  the  assembled  in¬ 
nerproduct  over  the  processors.  For  a.pxp  processor 
grid  these  communication  costs  are  proportional  with 
p.  This  means  that  for  a  constant  length  of  the  vec- 
torparts  per  processor,  these  communicationcosts  will 
dominate  for  values  of  p  large  enough.  This  is  quite 
unlike  the  situation  for  the  matrix  vector  product  and 
as  we  will  see  it  may  be  a  severely  limiting  factor  in 
achieving  high  speed-ups  in  a  massively  parallel  en¬ 
vironment. 

For  the  MEIKO  SPl  we  have  done  some  experi¬ 
ments  in  order  to  determine  the  costs  of  inter  proces- 
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Fig. 2:  Modelled  timings  for  1  iteration  with  CG. 


sor  communication  and  for  communication.  Assum¬ 
ing  that  the  costs  for  communication  (for  the  inner- 
products)  grow  linearly  with  the  length  of  the  path 
of  communication  we  have  modelled  the  wall-clock 
time  for  1  iteration  with  CG,  for  matrices  of  order 
90000P,  as  in  Figure  2.  Note  that  we  have  increased 
the  size  of  the  linear  system  linearly  with  the  num¬ 
ber  of  processors,  which  seems  realistically  since  with 
larger  computers  one  aims  to  solve  larger  systems. 
The  value  90000  has  been  chosen  since  this  is  more 
or  less  the  size  of  the  part  of  the  system  that  can 
be  kept  in  the  local  memory  of  one  processor  of  our 
MEIKO  machine. 

From  this  Figure  we  learn  that  for  P  slightly  larger 
than  400  the  communication  costs  may  be  expected 
to  dominate,  and  eventually  they  will  lead  to  very 
low  speed-ups  (even  for  systems  for  which  the  size  is 
as  large  as  the  total  memory  permits).  We  also  see, 
that  if  overlap  of  communication  and  computation 
is  possible,  then  potentially  the  communication  can 
be  hidden  for  values  of  P  less  than  400,  but  this  de¬ 
mands  for  a  reformulation  of  the  CG  algorithm.  Of 
course,  these  expectations  are  based  on  a  model,  but 
we  have  also  carried  out  similar  experiments  on  the 
512  processor  Parsytec  GCel-3/512  of  the  University 
of  Amsterdam  [15].  In  particular  we  have  observed 
on  that  machine  that  the  communication  time  for  the 
innerproduct  increases  like  which  just  explains 
the  behaviour  of  our  model  for  the  MEIKO-type  of 
architecture. 

Our  experiments  and  our  modelling  approach 
clearly  show  that  even  a  method  like  CG,  which 
might  be  anticipated  to  be  highly  parallel,  may  suf¬ 
fer  severely  from  the  communication  overhead  due 
to  the  required  innerproducts.  Our  study  indicates 
that  if  we  want  reasonable  speed-up  in  a  massively 
parallel  environment  then  the  local  memories  should 


also  be  much  larger  when  the  number  of  processors 
is  increased  in  order  to  accomodate  for  systems  large 
enough  to  compensate  for  the  increased  global  com¬ 
munication  costs. 

Another  approach  would  be  to  modify  the  CG 
method  such  that  the  innerproducts  take  relatively 
less  time.  Many  of  such  approaches  have  been  stud¬ 
ied  recently.  A  quite  popular  approach  is  to  refor¬ 
mulate  CG  such  that  the  required  innerproducts  can 
be  computed  simultaneously,  so  that  the  communi¬ 
cation  overhead  is  reduced  (the  communication  re¬ 
quired  for  2  simultaneous  innerproducts  is  almost  the 
same  as  for  1  innerproduct).  An  extreme  form  of  this 
approach  is  to  reformulate  CG  so  that  a  number  of 
basis  vectors  for  the  search  space  are  computed  with¬ 
out  making  them  orthogonal.  The  orthogonalization 
is  then  carried  out  afterwards,  and  in  this  approach 
most  of  the  communication  can  be  combined.  The 
numerical  stability  of  these  approaches  is  still  a  point 
of  concern.  For  an  overview  and  further  references 
see  [6].  For  some  other  iterative  methods,  such  as 
GMRES,  this  approach  can  be  quite  effective  as  is 
shown  in  [17]. 

Still  another  approach  is  to  try  to  more  useful  com¬ 
putational  work  per  iteration  step,  so  that  the  com¬ 
munication  for  the  two  innerproducts  takes  relatively 
less  time.  One  way  to  do  this  is  to  use  polynomial 
preconditioning,  i.e.,  the  preconditioner  consists  of 
a  number  of  matrix  vector  products  with  the  matrix 
A.  This  may  work  well  in  situations  where  the  matrix 
vector  product  requires  only  little  (local)  communica¬ 
tion.  Another  way  is  to  apply  domain  decomposition: 
the  given  domain  is  split  into  P,  say,  subdomains  with 
estimated  values  for  the  solutions  on  the  interfaces. 
Then  all  the  subproblems  are  solved  independently 
and  in  parallel.  This  way  of  approximating  the  solu¬ 
tion  may  be  viewed  as  a  preconditioning  step  in  an 
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iterative  method.  In  this  way  we  do  more  computa¬ 
tional  work  per  communication  step.  Unfortunately, 
depending  on  the  problem  and  on  the  way  of  decou¬ 
pling  the  subdomains  one  may  need  a  larger  number 
of  iteration  steps  for  larger  values  of  P,  which  may 
then,  of  course,  detoriate  the  overall  efficiency  of  the 
domain  decomposition  approach.  For  more  informa¬ 
tion  on  this  approach  we  also  refer  to  references  given 
in  [6]. 

If  a  given  architecture  permits  the  overlap  of  com¬ 
munication  with  computation,  then  we  may  try  to  re¬ 
formulate  CG  in  order  to  create  possibilities  for  over¬ 
lap.  For  the  (extrapolated)  MEIKO  this  may  help  for 
values  of  P  up  to  about  400.  For  larger  P  we  will  see 
communication  dominating  anyhow,  but  the  adverse 
effects  can  be  lessened.  A  stable  reformulation  of  CG 
which  has  this  effect  has  been  described  in  [20]. 

6  Unsymmetric  problems 

There  are  essentially  three  different  ways  to  solve 
unsymmetric  linear  systems,  while  maintaining  some 
kind  of  orthogonality  between  the  residuals: 

1.  Solve  the  normal  equations  Ax  =  X^b  with 
conjugate  gradients 

2.  Make  all  the  residuals  explicitly  orthogonal  in 
order  to  have  an  orthogonal  basis  for  the  Krylov 
subspace 

3.  Construct  a  basis  for  the  Krylov  subspace  by  a 
3-term  biorthogonality  relation 

6.1  Normal  Equations: 

The  first  solution  seems  rather  obvious.  However,  it 
has  severe  disadvantages  because  of  the  squaring  of 
the  condition  number.  This  has  as  effects  that  the  so¬ 
lution  is  more  susceptible  to  errors  in  the  right-hand 
side  and  that  the  rate  of  convergence  of  the  CG  pro¬ 
cedure  is  much  slower  as  for  a  comparable  symmetric 
system  with  a  matrix  with  the  same  condition  num¬ 
ber  as  A.  Moreover,  the  amount  of  work  per  iteration 
step,  necessary  for  the  matrix  vector  product,  is  dou¬ 
bled. 

There  have  been  made  several  proposals  to  improve 
the  numerical  stability  of  this  rather  robust  approach. 
The  most  well-known  is  by  Paige  and  Saunders  [61] 
and  is  based  upon  applying  the  Lanczos  method  to 
the  auxiliary  system 

Ar  o)(:)  =  (S 


Bjdrck  and  Elfving  [8].  They  observed  that  the  ma¬ 
trix  AX  A  is  used  in  the  construction  of  the  iteration 
coefficients  through  an  innerproduct  like  {p,A^Ap). 
They  simply  suggest  to  replace  such  an  innerproduct 
by  (Ap,Ap). 

The  use  of  conjugate  gradients  in  a  least  squares  con¬ 
text,  as  well  as  a  theoretical  comparison  with  SIRT 
type  methods,  is  discussed  in  [81]  and  [82]. 

An  interesting  variant  of  LSQR  is  the  so-called 
Craig’s  method  [61].  The  easiest  way  to  think  of  this 
method  is  to  apply  Conjugate  Gradients  to  the  sys¬ 
tem  AX  Ax  =  X^h,  with  the  following  choice  for  the 
innerproduct 

[x,y\^  {x,{X^A)-^y), 

which  defines  a  proper  innerproduct  if  A  is  of  full 
rank  (see  section  5.1.1). 

First  note  that  the  two  innnerproducts  in  CG  (as 
in  section  5.1.1  can  be  computed  without  inverting 
A'^A: 

[pi,A'^Api]  -  {pi,Pi), 

and,  assuming  that  b  G  R{A)  so  that  Ax  =  b  has  a 
unique  solution  x  (since  A  has  full  rank): 

[ri,  Vi]  =  [A'^{Axi  -  b),  X^  {Ax i  -  b] 

=  [A'^A{xi  —  x),  A'^{Axi  -  6)] 

=  {xi  -  X,  A^(Aa;i  -  6)) 

(6.1a)  =  {Axi  —  b,Axi  —  b) 

Apparantly,  we  are  with  CG  minimizing 

[xi  -  X,  A^A(xi  -  a;)]  =  (xi  -  x,Xi  -  x) 

(6.1b)  ==  llxj  —  xjlj 

that  is,  in  this  approach  the  Euclidean  norm  of  the 
error  is  minimized.  Note,  however,  that  the  rate  of 
convergence  of  Craig’s  method  is  determined  by  the 
condition  number  of  X^ A,  so  that  this  method  is  only 
attractive  if  one  has  a  good  preconditioner  for  A'^A. 

6.2  FOM  and  GMRES: 

The  second  approach  is  to  form  explicitly  an  or¬ 
thonormal  basis  for  the  Krylov  subspace.  Since  A  is 
not  symmetric  we  no  longer  have  a  3-term  recurrence 
relation  for  that  purpose  and  the  new  basis  vector 
has  to  be  made  explicitly  orthonormal  with  respect 
to  all  the  previous  vectors: 


Clever  execution  of  this  delivers  in  fact  the  factors 
L  and  U  of  the  TU-decomposition  of  the  tridiagonal 
matrix  that  would  have  been  delivered  when  carrying 
out  the  Lanczos  procedure  with  X^  A. 

Another  approach  to  improve  the  numerical  stabil¬ 
ity  of  this  normal  equations  approach  is  suggested  by 


Vi 


—  Avi  ji'^j  • 

1  =  1 
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As  in  the  symmetric  case  this  can  be  exploited  in  two 
different  ways.  The  orthogonality  relation  can  either 
be  written  as 

(6.2a)  AVi  =  ViHi  +  hi+i^iVi+ieJ , 

after  which  the  projected  system,  with  a  Hessen- 
berg  matrix  instead  of  a  tridiagonal  matrix  as  in  the 
symmetric  case,  can  be  solved  (nonsymmetric  CG, 
GENCG,  FOM,  Arnoldi’s  method),  or  it  can  be  writ¬ 
ten  as 

(6.2b)  AVi  =  V+i^i, 

after  which  the  projected  system,  with  an  i  -f  1  by 
i  upper  Hessenberg  matrix  can  be  solved  as  a  least 
squares  system.  In  GMRES  [72]  this  is  done  by  the 
QR  method  using  Givens  rotations  in  order  to  anni¬ 
hilate  the  subdiagonal  elements  in  the  upper  Hessen¬ 
berg  matrix  Hi- 

The  first  approach  (based  upon  (6.2a))  is  similar  to 
the  conjugate  gradient  approach  (or  SYMMLQ),  the 
second  approach  (based  upon  (6.2b))  is  similar  the 
conjugate  directions  method  (or  MINRES). 

In  order  to  avoid  excessive  storage  requirements 
and  computational  costs  for  the  orthogonalization, 
GMRES  is  usually  restarted  after  each  m  iteration 
steps.  This  algorithm  is  referred  to  as  GMRES(m). 
Below  we  give  a  scheme  for  GMRES  (m)  which  may 
be  suitable  to  develop  a  computer  code.  It  solves 
Ax  =  6,  with  a  given  preconditioner  K . 

xq  is  an  initial  guess; 
for  j  =  1,  2, .... 

Solve  r  from  Kr  =  b  —  Axq', 

vi  =  '■/Iklb; 
s  Ikibei; 
for  i  =  1,  2, ...,  m 

Solve  w  from  Kw  —  Avi\ 
orthogonalization  of  w 
against  v’s,  by  modified 
Gram-Schmidt  process 
for  ^  =  1, ...,  i 
hk,i  =  {w,Vk)', 
w  =  w-  hk,iVk-, 
end  k; 

hi+i,i  =  ll^lb; 

^Pply  Jit  ••  •  j  Ji  —  l  on  {hi, it  •••)  ) 

construct  Ji,  acting  on  i-th 
and  (i  -|-  l)-st  component 
of  such  that  {i  +  l)-st 
component  of  Jj/i,  j  is  0; 

S  .  —  JiS , 

if  s(i  -f  1)  is  small  enough  then: 
(UPDATE(£,  i);  quit); 
end  i; 

UPDATE(i,m); 
end  j; 


In  this  scheme  UPDATE(£,  i)  replaces  the  follow¬ 
ing  computations: 

Compute  y  as  the  solution  of  Hy  =  s, 
in  which  the  upper  i  by  i  triangular 
part  of  H  has  hij  as  its  elements 
(in  least  squares  sense  if  H  is  singular), 
s  represents  the  first  i  components  of  s; 

X  =  xo  +  yi  *vi  +  y2V2  -I- ...  ■+  ViVi] 

Si+i  equals  ||6  —  Ax\\2', 

if  this  component  is  not  small  enough 

then  Xo  =  x; 

else  quit; 

Another  scheme  for  GMRES,  based  upon  House¬ 
holder  orthogonalization  instead  of  modified  Gram- 
Schmidt  has  been  proposed  in  [92].  For  certain  ap¬ 
plications  it  seems  attractive  to  invest  in  additional 
computational  work  in  turn  for  improved  numerical 
properties:  the  better  orthogonality  might  save  iter¬ 
ation  steps. 

The  eigenvalues  of  Hi  are  the  Ritz  values  of  A 
with  respect  to  the  Krylov  subspace  spanned  by  ui, 
...,  Vi-  They  approximate  eigenvalues  of  K~^A  in¬ 
creasingly  well  for  increasing  dimension  i. 

There  is  an  interesting  and  simple  relation  be¬ 
tween  the  two  different  Krylov  subspace  projection 
approaches  (6.2a),  the  ”FOM”  approach,  and  (6.2b), 
the  ’’GMRES”  approach.  The  projected  system  ma¬ 
trix  Hi  is  transformed  by  a  Givens  rotations  to  an 
upper  triangular  matrix  (with  last  row  equal  to  zero). 
So,  in  fact,  the  major  difference  between  FOM  and 
GMRES  is  that  in  FOM  the  last  ((z-l-l)-th  row  is  sim¬ 
ply  discarded,  while  in  GMRES  this  row  is  rotated  to 
a  zero  vector.  Let  us  characterize  the  Givens  rotation, 
acting  on  rows  i  and  i+1,  in  order  to  zero  the  element 
in  position  (f  -f  1,  f),  by  the  sine  s*  and  the  cosine  c,. 
Let  us  further  denote  the  residuals  for  FOM  with  an 
superscript  F  and  those  for  GMRES  with  superscript 
G.  Then  the  above  observations  lead  to  the  following 
results  for  FOM  and  GMRES  (for  details  see  [72]  and 

[9])- 

1.  The  reduction  for  successive  GMRES  residuals 
is  given  by 

([72]:  p.  862,  Proposition  1) 

2.  If  Cjfc  7^  0  then  the  FOM  and  the  GMRES  resid¬ 
uals  are  related  by 

(6. 2d)  ||rf  II2  =  \ck\  ||rf  II2 


([9]:  theorem  5.1) 
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From  these  relations  we  see  that  when  GMRES  has  a 
local  significant  reduction  in  the  norm  of  the  residual 
(i.e.,  Sk  is  small),  then  FOM  gives  about  the  same 
result  as  GMRES  (since  c|  =  1  —  s|).  On  the  other 
hand  when  FOM  has  a  break-down  (c*,  =  0),  then 
the  GMRES  does  not  lead  to  an  improvement  in  the 
same  iteration  step. 

Because  of  these  relations  we  can  link  the  conver¬ 
gence  behaviour  of  GMRES  with  the  convergence  of 
Ritz  values  (the  eigenvalues  of  the  ”FOM”  part  of  the 
upper  Hessenberg  matrix).  This  has  been  exploited 
in  [88],  for  the  analysis  and  explanation  of  local  ef¬ 
fects  in  the  convergence  behaviour  of  GMRES. 

In  order  to  limit  the  required  amount  of  memory 
storage  and  the  amount  of  flops  per  iteration  step, 
one  often  restarts  the  GMRES  method  after  each  m 
steps.  This  restarted  version  is  commonly  referred  to 
as  GMRES(m),  while  the  not-restarted  method  often 
is  called  Full  GMRES. 

There  are  various  different  implementations  of 
FOM  and  GMRES.  Among  those  equivalent  with 
GMRES  are:  Orthomin  [91],  Orthodir  [44],  Axels- 
son’s  method  [3]  and  GENCR  [27].  These  methods 
are  often  more  expensive  than  GMRES  per  itera¬ 
tion  step.  Orthomin  seems  to  be  still  popular,  since 
this  variant  can  be  easily  truncated  (Orthomin(s)), 
in  contrast  to  GMRES.  The  truncated  or  restarted 
versions  of  these  algorithms  are  not  necessarily  math¬ 
ematically  equivalent. 

Methods  that  are  mathematically  equivalent  with 
FOM  are:  Orthores  [44]  and  GENCG  [13,  93].  In 
these  methods  the  approximate  solutions  are  con¬ 
structed  such  that  they  lead  to  orthogonal  residuals 
(which  form  a  basis  for  the  Krylov  subspace;  analo¬ 
gously  to  the  CG  method).  A  good  overview  of  all 
these  methods  and  their  relations  is  given  in  [71]. 

6.3  Rank-one  updates  for  the  Matrix  Split¬ 
ting: 

Iterative  methods  can  be  derived  from  a  splitting  of 
the  matrix,  and  we  have  used  the  very  simple  split¬ 
ting  A  —  I  — R,  with  R  =  I  — A,  in  order  to  derive  the 
projection  type  methods.  In  [26]  it  is  suggested  to  up¬ 
date  the  matrix  splitting  with  information  obtained 
in  the  iteration  process.  We  will  give  the  flavour  of 
this  method  here  since  it  turns  out  that  it  has  an  in¬ 
teresting  relation  with  GMRES.  This  relation  is  ex¬ 
ploited  in  [89]  for  the  construction  of  new  classes  of 
GMRES-like  methods,  that  can  be  used  as  cheap  al¬ 
ternatives  for  the  increasingly  expensive  full  GMRES 
method. 

Assume  that  the  matrix  splitting  in  the  A;-th  itera¬ 
tion  step  is  given  by  A  =  -  R^,  then  we  obtain 


the  iteration  formula 

Xk  =  Xk-i  +  Hkrk-i  with  rk  =  b-Axk. 

The  idea  is  now  to  construct  Hk  by  a  suitable  rank- 
one  update  to  Hk-\'. 

Hk  =  Hk-i  +  Uk-iVk_i, 

which  leads  to 

(6.3a)  Xk  =  Xk-i  +  {Hk-i  +  Uk.-\Vk_i)rk-\ 
or 

rk  =  Vk-i  -  A{Hk-i  +  Uk-ivJ_i)rk-i 

(6.3b)  =  {I  -  AHk-i)rk-i  -  Auk-iVk_xrk-i 

=  {I  -  AHk-i)rk-i-  Hk-iAuk-i- 

The  optimal  choice  for  the  update  would  have  been 
to  select  Uk-i  such  that 

fik-iAuk-i  —  {I  —  AHk-i)rk-i, 

or 

Mk-i^k-i  =  A~^{I  —  AHk-\)rk-\- 

However,  A~^  is  unknown  and  the  best  approxima¬ 
tion  we  have  for  it  is  Hk-i-  This  leads  to  the  choice 

(6.3c)  uk-i  =  Hk-i{I  -  AHk-i)rk-i. 

The  constant  fik-i  is  chosen  such  that  ||rjb||2  is  min¬ 
imal  as  a  function  of  Hk-i-  This  leads  to 

Since  Vk-i  has  to  be  chosen  such  that  Hk-i  = 
Vk-i^k-i,  we  have  the  following  obvious  choice  for 
it 

(6.3d)  Vk-\  =  7^-1^  ^{I  -  AHk-if  Auk-i 

(note  that  from  the  minimization  property  we  have 
that  rk  -L  Auk-i). 

In  principle  the  implementation  of  the  method  is 
quite  straight  forward,  but  note  that  the  computation 
of  rk-i,  Uk-i  and  Vk-t  costs  4  matrix  vector  multi¬ 
plications  with  A  (and  also  some  with  Hk-i).  This 
would  make  the  method  too  expensive  for  being  of 
practical  interest.  Also  the  updated  splitting  is  most 
likely  a  dense  matrix  if  we  carry  out  the  updates  ex¬ 
plicitly. 

We  will  now  show,  still  following  the  lines  set  forth  in 
[26],  that  there  are  orthogonality  properties,  follow¬ 
ing  from  the  minimization  step,  by  which  the  method 
can  be  implemented  much  more  efficiently. 

We  define 
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2.  Ek  =  I-  AHk 

From  (6.3b)  we  have  that  n  =  EkVk-i,  and  from 
(6.3c): 

Auk  =  AHkEkVk  =  akCk 


-{I-Ek)Ekrk 

-Ek{I  —  Ek)rk 


Furthermore  (on  behalf  of  (6.3c)); 
Ek  —  I  -  AHk 


-  I  -  AHk-i- Auk-\vl_-^ 

=  I  -  AHk-i  -  Auk-\{Auk-i) 

=  (/  -  Cj;_icf’_l)£'jfc_l 


and  hence 

(6.3g)  Ck  Eco,...,Ck-i. 

A  consequence  from  (6.3g)  is  that 


n(^  -  CjcJ)  =  /  -  5]  c,-cj  =  J  -  Pk- 
j=0  j-O 


The  expression  with  ^  leads  to  a  Gram-Schmidt  for¬ 
mulation,  the  expression  with  fj  leads  to  the  Modi¬ 
fied  Gram-Schmidt  variant. 

The  computed  updates  for  Vk+i  correspond 

to  updates 

for  Xj+i-  These  updates  are  in  the  scheme,  given 
below,  represented  by  rj. 

From  (6.3c)  we  know  that 

Uk  =  HkEkVk  =  HkC^*"^- 

Now  we  have  to  make  Auk  ~  Ck  orthogonal  w.r.t. 
Co,  ...,  Cjfe-i,  and  to  update  Uk  accordingly.  Once 
we  have  done  that  we  can  do  the  final  update 
step  to  make  fffc+i,  and  we  can  update  both  Xk 
and  rk  by  the  corrections  following  from  includ¬ 
ing  Ck-  The  orthogonalization  step  can  be  car- 
ried  out  easily  as  follows.  Define  c\  ’  =  ajfeCt  = 
AHkEkVk  -  {I  -  Ek)Ekrk  (see  (6.3e))  =  (/  - 

Eo+Pk-iEo)^^'^'>  (see  (6.3f))  =  AHo^^’‘^+Pk-i{I- 


(6.3f)  =  Hil  -  CicJ)Eo. 

i=0 

We  see  that  the  operator  Ek  has  the  following  effect 
on  a  vector.  The  vector  is  multiplied  by  Eo  and  then 
orthogonalized  with  respect  to  cq,  ...,  Ck-i-  Now  we 
have  from  (6.3e)  that 


Ck  =  — EkVk, 
ctk 


AHo)^^’‘^ 


-I-  Pk-i^^'^^  —  Pk-ic^k  ■  Note  that  the 


second  term  vanishes  since  T  cq,  ...,Ck-i- 

The  resulting  scheme  for  the  A;-th  iteration  step 
becomes; 

1.  =  {I  -  AHo)rk-,  =  HoVk] 
for  j  =  0, ...,  —  1  do 

Oj  =  c'-'  J 

^(j-t-1)  _  ^(i)  _  7^b+i)  =  r/(®)  -t-  aiUi] 

2.  =  Au^°^-, 
for  i  =  0, ...,  A:  —  1  do 


-f  PiUi] 

Ck  =  4*'V||4*'^I|2;  Uk  =  4*Vl|cf 

^^+1  —  Xk  +  +  ukc'k^^''^; 

rk+i  -  {I  -  CkclX^^'f] 


and  therefore  Remarks 


(6.3h)  Pk  =  Y^CjcJ. 

The  actual  implementation  is  based  on  the  above 
properties.  Given  rk  we  compute  rk+i  as  follows  (and 
we  update  Xk  vn.  the  corresponding  way): 

I’fc-l-i  =  Ek.^.\rk- 

With  =  E^rk  we  first  compute  (with  the  Cj  from 
previous  steps): 

EkVk = = (/  -  E  = n(^  - 

j=0  j=0 


1.  The  above  scheme  is  a  Modified  Gram  Schmidt 
variant,  given  in  [89],  of  the  original  scheme  in 
[26]. 

2.  If  we  keep  Ho  fixed,  i.e.,  Hq  =  I,  then  the 
method  is  not  scaling  invariant  (the  results  for 
pAx  =  pb  depend  on  p).  In  [89]  a  scaling  invari¬ 
ant  method  is  suggested. 

3.  Note  that  in  the  above  implementation  we  have 
’only’  two  matrix  vector  products  per  iteration 
step.  In  [89]  it  is  shown  that  in  many  cases  we 
may  also  expect  comparable  converge  as  for  GM- 
RES  in  half  the  number  of  iteration  steps. 
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4.  A  different  choice  for  Uk-\  does  not  change  the 
formulas  for  Vk-i  and  E^-i-  For  each  different 
choice  we  can  derive  similar  schemes  as  the  one 
above. 

5.  From  (6.3b)  we  have 

Vk  -  Tk-i  -  AHk-iric-i  -  ^k-iAuk-i. 

In  view  of  the  previous  remark  we  might  also 
make  the  different  choice  Uk-i  —  Hk-iVk-i- 
With  this  choice,  we  obtain  a  variant  which  is 
algebraically  identical  to  GMRES  (for  a  proof  of 
this  see  [89]).  This  GMRES  variant  is  obtained 
by  the  following  changes  in  the  previous  scheme: 
Take  Hq  =  0  (note  that  in  this  case  we  have  that 
Ek-iVk-i  —  Vk-i,  and  hence  we  we  may  skip 
part  1  of  the  above  algorithm),  and  set  =  r*,, 
7y(*)  =  0.  In  step  2  start  with 
The  result  is  a  different  formulation  of  GMRES 
in  which  we  can  obtain  explicit  formulas  for  the 
updated  preconditioner  (i.e.,  the  inverse  of  A  is 
approximated  increasingly  well):  The  update  for 
Hk  is  UkC^Ek  and  the  sum  of  these  updates  gives 
an  approximation  for  A~^. 

6.  Also  in  this  GMRES- variant  we  are  still  free 
to  select  Uk  a  little  bit  different.  Remember 
that  the  leading  factor  Hk-i  in  (6.3c)  was  in¬ 
troduced  as  an  approximation  for  the  actually 
desired  A~^ .  With  Uk-i  =  A~^rk-i,  we  would 
have  that  r*  =  Ek-irk-\- ^lk-lrk-l  =  0  for  the 
minimizing  Hk-\-  We  could  take  other  approxi¬ 
mations  for  the  inverse  (with  respect  to  the  given 
residual  rk-i),  e.g.,  the  result  vector  y  obtained 
by  a  few  steps  GMRES  applied  to  Ay  —  rk-i- 
This  leads  to  the  so-called  GMRESR  family  of 
nested  methods  (for  details  see  [89]).  See  also 
section  6.4.  A  similar  algorithm,  named  FGM- 
RES,  has  been  derived  independently  by  Saad 
[70].  In  FGMRES  the  search  directions  are  pre¬ 
conditioned,  whereas  in  GMRESR  the  residuals 
are  preconditioned.  This  gives  GMRESR  direct 
control  over  the  reduction  in  norm  of  the  resid¬ 
ual.  As  a  result  GMRESR  can  be  made  robust 
while  FGMRES  may  suffer  from  break-down.  A 
further  disadvantage  of  the  FGMRES  formula¬ 
tion  is  that  this  method  cannot  be  truncated,  or 
selectively  orthogonalized,  as  GMRESR  can. 

In  [4]  a  generalized  conjugate  gradient  method  is 
proposed,  a  variant  of  which  produces  in  exact 
arithmetic  identical  results  as  the  proper  variant 
of  GMRESR,  though  at  higher  computational 
costs  and  with  a  classical  Gram-Schmidt  orthog- 
onalization  process  instead  of  the  modified  pro¬ 
cess  as  in  GMRESR. 


6.4  GMRESR  and  GMRES*: 

By  Van  der  Vorst  and  Vuik  [89]  it  has  been  shown 
how  the  GMRES-method  can  be  combined  (or  rather 
preconditioned)  with  other  iterative  schemes.  The  it¬ 
eration  steps  of  GMRES  (or  GCR)  are  called  outer 
iteration  steps,  while  the  iteration  steps  of  the  pre¬ 
conditioning  iterative  method  are  referred  to  as  inner 
iterations.  The  combined  method  is  called  GMRES*, 
where  *  stands  for  any  given  iterative  scheme;  in  the 
case  of  GMRES  as  the  inner  iteration  method,  the 
combined  scheme  is  called  GMRESR[89]. 

Similar  schemes  have  been  proposed  recently.  In 
FGMRES[70]  the  update  directions  for  the  ap¬ 
proximate  solution  are  preconditioned,  whereas  in 
GMRES*  the  residuals  are  preconditioned.  The  lat¬ 
ter  approach  offers  more  control  over  the  reduction  in 
the  residual,  in  particular  break-down  situations  can 
be  easily  detected  and  remedied. 

In  exact  arithmetic  GMRES*  is  very  close  to  the  Gen¬ 
eralized  Conjugate  Gradient  method[4];  GMRES*, 
however,  leads  to  a  more  efficient  computational 
scheme. 

The  GMRES*  algorithm  can  be  described  by  the 
following  computational  scheme: 

xo  is  an  initial  guess;  ro  =  6  —  Axq] 
for  i  =  0, 1,  2, 3, ... 

Let  be  the  approximate  solution 
of  Az  =  Vi,  obtained  after  m  steps  of 
an  iterative  method, 
c  =  (often  available  from  the 

iteration  method) 
for  A;  =  0, ...,  i—l 
ot  =  {ck,c) 
c  =  c  —  ack 
z(m)  =  zi"')  -  auk 
=  c/||c|i2;  Ui  -  ^^™Vl|c||2 

Xij^i  -  Xi  +  {a,  ri)ui 

Ti^i  =  n  -  {Ci,ri)ci 

if  Xi^i  is  accurate  enough  then  quit 

end 

A  sufficient  condition  to  avoid  break-down  in  this 
method  (||c||2  =  0)  is  that  the  norm  of  the  residual 
at  the  end  of  an  inner  iteration  is  smaller  than  the 
right-hand  residual:  ||Az*^'")  -  ri||2  <  ||ri||2.  This  can 
easily  be  controlled  during  the  inner  iteration  process. 
If  stagnation  occurs,  i.e.  no  progress  at  all  is  made 
in  the  inner  iteration,  then  it  is  suggested  by  Van  der 
Vorst  and  Vuik[89]  to  do  one  (or  more)  steps  of  the 
LSQR  method,  which  gnarantees  a  reduction  (but 
this  reduction  is  often  only  small). 

The  idea  behind  this  combined  iteration  scheme 
is  that  we  explore  parts  of  high-dimensional  Krylov 
subspaces,  hopefully  localizing  the  same  approximate 
solution  that  full  GMRES  would  find  over  the  en¬ 
tire  subspace,  but  now  at  much  lower  computational 
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costs.  The  alternatives  for  the  inner  iteration  could 
be  either  one  cycle  of  GMRES(m),  since  then  we  have 
also  locally  an  optimal  method,  or  some  other  iter¬ 
ation  scheme,  like  for  instance  Bi-CGSTAB.  As  has 
been  shown  by  Van  der  Vorst[87]  there  are  a  number 
of  different  situations  where  we  may  expect  stagna¬ 
tion  or  slow  convergence  for  GMRES(m).  In  such 
cases  it  does  not  seem  wise  to  use  this  method. 

On  the  other  hand  it  may  also  seem  questionable 
whether  a  method  like  Bi-CGSTAB  should  lead  to 
success  in  the  inner  iteration.  This  method  does 
not  satisfy  a  useful  global  minimization  property 
and  large  part  of  its  effectiveness  comes  from  the 
underlying  Bi-CG  algorithm,  which  is  based  on  bi¬ 
orthogonality  relations.  This  means  that  for  each 
outer  iteration  the  inner  iteration  process  has  to  build 
a  bi-orthogonality  relation  again.  It  has  been  shown 
for  the  related  Conjugate  Gradients  method  that  the 
orthogonality  relations  are  determined  largely  by  the 
distribution  of  the  weights  at  the  lower  end  of  the 
spectrum  and  on  the  isolated  eigenvalues  at  the  up¬ 
per  end  of  the  spectrum[82].  By  the  nature  of  these 
kind  of  Krylov  processes  the  largest  eigenvalues  and 
their  corresponding  eigenvector  components  quickly 
do  enter  the  process  after  each  restart,  and  hence 
it  may  be  expected  that  much  of  the  work  is  lost 
in  rediscovering  the  same  eigenvector  components  in 
the  error  over  and  over  again,  whereas  these  compo¬ 
nents  may  already  be  so  small  that  further  reduction 
in  those  directions  in  the  outer  iteration  is  waste  of 
time,  since  it  hardly  contributes  to  a  smaller  norm  of 
the  residual. 

This  heuristic  way  of  reasoning  may  explain  in 
part  our  rather  disappointing  experiences  with  Bi- 
CGSTAB  as  the  inner  iteration  process  for  GMRES-*. 

De  Sturler  and  Fokkema[19]  propose  to  prevent  the 
outer  search  directions  explicitly  from  being  reinves¬ 
tigated  again  in  the  inner  process.  This  is  done  by 
keeping  the  Krylov  subspace  that  is  build  in  the  in¬ 
ner  iteration  orthogonal  with  respect  to  the  Krylov 
basis  vectors  generated  in  the  outer  iteration.  The 
procedure  works  as  follows. 

In  the  outer  iteration  process  the  vectors  cq,  c,_i 
build  an  orthogonal  basis  for  the  Krylov  subspace. 
Let  Ci  be  the  n  by  i  matrix  with  columns  cq, 
c,_i.  Then  the  inner  iteration  process  at  outer  iter¬ 
ation  i  is  carried  out  with  the  operator  Ai  instead  of 
A,  and  Ai  is  defined  as 

(6.4a)  Ai  =  (I-  CiCj)A. 

It  is  easily  verified  that  AiZ  L  co,...,Ci_i  for  all  z, 
so  that  the  inner  iteration  process  takes  place  in  a 
subspace  orthogonal  to  these  vectors.  The  additional 
costs,  per  iteration  of  the  inner  iteration  process,  are 
i  inner  products  and  i  vector  updates.  In  order  to 
save  on  these  costs,  one  should  realize  that  it  is  not 


necessary  to  orthogonalize  with  respect  to  all  previ¬ 
ous  c-vectors,  and  that  “less  effective”  directions  may 
be  dropped,  or  combined  with  others.  De  Sturler  and 
Fokkema[19]  suggestions  are  made  for  such  strategies. 
Of  course,  these  strategies  in  cases  where  we  see  too 
little  residual  reducing  effect  in  the  inner  iteration 
process  in  comparison  with  the  outer  iterations  of 
GMRES*. 

6.5  Bi-conjugate  Gradients: 

The  third  class  of  methods  arises  from  the  attempt 
to  construct  a  suitable  set  of  basis  vectors  for  the 
Krylov  subspace  by  a  three-term  recurrence  relation 
as  in  (5.0a): 

(6.5a)  aj+irj+i  =  Avj  -  PjVj  - 

As  we  have  seen  in  the  proof  for  the  orthogonality  of 
such  a  set  of  vectors  (see  section  4),  we  needed  the 
symmetry  of  the  matrix  A.  In  the  nonsymmetric  case 
we  need  instead  of  (5.0b)  that 

(Arj^i,rk)  =  (rj-i,A^n)  =  0  for  k  <  j  -  2. 

By  similar  arguments  as  in  the  proof  for  (5.0a)  we 
conclude  that  (6.5a)  can  be  used  to  generate  a  basis 
ro,...,ri-i  for  K*(A;  ro),  such  that  rj  1  ro), 

or  even  more  general, 

rj  1  K^~^{A'^;so), 

since  there  is  no  explicit  need  to  generate  the  Krylov 
subspace  for  A'^  with  tq  as  the  starting  vector. 

If  we  let  the  basis  vectors  Sj  for  K^(A'^;so)  satisfy  the 
same  recurrence  relation  as  the  vectors  rj,  i.e.,  with 
identical  recurrence  coefficients,  then  we  see  that 

(vkySj)  =  0  for  k 

(by  a  simple  symmetry  argument). 

Hence,  the  sets  {rj}  and  {sj}  satisfy  a  hiorthogonaliiy 
relation.  Now  we  can  proceed  in  a  similar  way  as  in 
the  symmetric  case: 

(6.5b)  ARi  —  RiTi  +  a,r,ef , 

but  now  we  use  the  matrix  Si  —  [sOiSij  for 

the  projection  of  the  system 

S'f  (Axi  -  6)  =  0, 

or 

Sf  ARiV  -  Sf  b  =  0. 

Using  (6.5b)  we  find  that  y;  satisfies 

Sj  RiTiV  =  (ro,so)ei. 

Since  Sf  Ri  is  a.  diagonal  matrix  with  diagonal  ele¬ 
ments  {rj,Sj),  we  find,  if  all  these  diagonal  elements 
are  nonzero,  that 

TiV  =  6i  ^  Xi  —  RiV' 
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This  method  is  known  as  the  Bi-Lanczos  method  [47]. 
We  see  that  we  are  in  problems  when  a  diagonal  el¬ 
ement  of  Sf  Ri  becomes  (near)  zero,  this  is  referred 
to  in  litterature  as  a  serious  (near)  breakdown.  The 
way  to  get  around  this  difBculty  is  the  so-called  Look¬ 
ahead  strategy,  which  comes  down  to  taking  a  num¬ 
ber  of  successive  basis  vectors  for  the  Krylov  subspace 
together  and  to  make  them  blockwise  bi-orthogonal. 
This  has  been  worked  out  in  detail  in  [62]  and  [30], 
[31],  [32]. 

Another  way  to  avoid  break-down  is  to  restart  as 
soon  as  a  diagonal  element  gets  small.  Of  course,  this 
strategy  looks  surprisingly  simple,  but  one  should  re¬ 
alise  that  at  a  restart  the  Krylov  subspace,  that  has 
been  built  up  so  far,  is  thrown  away,  which  destroys 
possibilities  for  faster  (i.e.,  superlinear)  convergence. 

As  has  been  shown  for  Conjugate  Gradients,  the 
LU  decomposition  of  the  tridiagonal  system  can  be 
updated  from  iteration  to  iteration  and  this  leads  to 
a  recursive  update  of  the  solution  vector.  This  avoids 
to  save  all  intermediate  r  and  s  vectors.  This  variant 
of  Bi-Lanczos  is  usually  called  Bi-Conjugate  Gradi¬ 
ents,  or  shortly  Bi-CG  [28]. 

Of  course  one  can  in  general  not  be  certain  that  an 
LU  decomposition  (without  pivoting)  of  the  tridiago¬ 
nal  matrix  T,  exists,  and  this  may  lead  also  to  break¬ 
down  of  the  Bi-CG  algorithm.  Note  that  this  break¬ 
down  can  be  avoided  in  the  Bi-Lanczos  formulation 
of  this  iterative  solution  scheme.  It  is  also  avoided  in 
the  QMR  approach  (see  Section  5.4.2). 

Note  that  for  symmetric  matrices  Bi-Lanczos  gen¬ 
erates  the  same  solution  as  Lanczos,  provided  that 
So  =  J’o,  and  under  the  same  condition  Bi-CG  de¬ 
livers  the  same  iterands  as  CG  for  positive  definite 
matrices.  However,  the  Bi-orthogonal  variants  do  so 
at  the  cost  of  two  matrix  vector  operations  per  iter¬ 
ation  step. 

It  is  difficult  to  make  a  fair  comparison  between 
GMRES  and  Bi-CG.  GMRES  really  minimizes  a 
residual,  but  at  the  cost  of  increasing  work  for  keep¬ 
ing  all  residuals  orthogonal  and  increasing  demands 
for  memory  space.  Bi-CG  does  not  minimize  a  resid¬ 
ual,  but  often  it  has  a  comparable  fast  convergence 
as  GMRES,  at  the  cost  of  twice  the  amount  of  matrix 
vector  products  per  iteration  step.  However,  the  gen¬ 
eration  of  the  basis  vectors  is  relatively  cheap  and  the 
memory  requirements  are  limited  and  modest.  Sev¬ 
eral  variants  of  Bi-CG  have  been  proposed  which  in¬ 
crease  the  effectiveness  of  this  class  of  methods  in  cer¬ 
tain  circumstances.  These  variants  will  be  discussed 
in  coming  subsections. 

The  following  scheme  may  be  used  for  a  computer 
implementation  of  the  Bi-CG  method.  In  the  scheme 
the  equation  Ax  =  b  is  solved  with  a  suitable  precon¬ 


ditioner  K. 

xo  is  an  initial  guess;  ro  =  b  —  Axq; 

solve  Wo  from  Kwq  =  Tq; 

fo  is  an  arbitrary  vector  such  that  (wo,  ro)  ^  0, 

usually  one  chooses  ro  —  ro  or  fo  =  ruo; 

solve  u>o  from  K'^wq  =  fo; 

p_i  =  p-i  =  0;/?_i  =  0;/9o  =  (wo,fo); 

for  i  =  0, 1,2, .... 

Pi  =  lOi  -f  A-iPi-i; 

Pi  -  Wi+f3i-ipi-i  ; 

Zi  -  Api] 


ri+i  =ri-  aiZi\ 
fi+i  =  ri-  ctiA'^pi] 
solve  Wi+i  from  Kwi^i  =  rj+i; 
solve  from  K'^Wi^i  —  fj+i; 

A+i  =  ; 

T  OiiPi^ 

if  Xj+i  is  accurate  enough  then  quit; 

8-  — 
ft  -  Pi 

end 

As  with  conjugate  gradients,  the  coefficients  aj  and 
j3j,j  =  0,...,i— 1  build  the  matrix  Ti,  as  given  in 
formula  (5.1b).  This  matrix  is,  for  BiCG,  in  general 
not  similar  to  a  symmetric  matrix.  Its  eigenvalues  can 
be  viewed  as  Petrov-Galerkin  approximations,  with 
respect  to  the  spaces  {f^}  and  {rj},  of  eigenvalues  of 
A.  For  increasing  values  of  i  they  tend  to  converge  to 
eigenvalues  of  A.  The  convergence  patterns,  however, 
may  be  much  more  complicated  and  irregular  as  in 
the  symmetric  case. 

6.5.1  Another  derivation  of  Bi-CG 

An  alternative  way  to  derive  Bi-CG  comes  from  con¬ 
sidering  the  following  symmetric  linear  system; 

[a^  o)(0"(0’  “ 

for  some  suitable  vector  b. 

If  we  select  6  =  0  and  apply  the  CG-scheme  to  this 
system,  then  we  obtain  LSQR  again.  However,  if 
we  select  b  ^  0  and  apply  the  CG  scheme  with  the 
preconditioner 


in  the  way  as  is  shown  in  section  4.4.1,  then  we  obtain 
right  away  the  unpreconditioned  Bi-CG  scheme  for 
the  system  Ax  =  6.  Note  that  the  CG-scheme  can  be 
applied  since  K~^B  is  symmetric  (but  not  positive 
definite)  with  respect  to  the  bilinear  form 

[P,?]  =  {p,Kq), 

which  is  not  a  proper  innerproduct.  Hence,  this  for¬ 
mulation  clearly  reveals  the  two  principal  weaknesses 
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of  Bi-CG  (i.e.,  the  causes  for  break-down  situations). 
Note  that  if  we  restrict  ourselves  to  vectors 


then  [p,  q]  defines  a  proper  innerproduct.  This  situ¬ 
ation  arises  for  the  Krylov  subspace  that  is  created 
for  B  and  h  '\i  A  —  and  6  =  6.  If,  in  addition, 
A  is  positive  definite  then  K~^B  is  positive  definite 
symmetric  with  respect  to  the  generated  Krylov  sub¬ 
space,  and  we  obtain  the  CG-scheme  (as  expected). 
More  generally,  the  choice 


where  Ki  is  a  suitable  preconditioner  for  A,  leads  to 
the  preconditioned  version  of  the  Bi-CG  scheme,  as 
given  in  section  5.4. 

The  above  presentation  of  Bi-CG  was  inspired  by  a 
closely  related  presentation  of  BI-CG  in  [42] .  The  lat¬ 
ter  paper  gives  a  rather  untractable  reference  for  the 
choice  of  the  system  Bx  =  6  and  the  preconditioner 


for  which 

jjAxi  -  bll2  =  ilARiy  -  bH2 
=  \\Ri+iTiy  -  6||2 

=  WRi+iDi+iiDi+ifiy  -  ||ro||2ei}||2 

is  minimal.  However,  in  this  case  that  would  be  quite 
an  amount  of  work  since  the  columns  of  Ri+i  are 
not  necessarily  orthogonal.  Freund  and  Nachtigal  [32] 
suggest  to  solve  the  miniminum  norm  least  squares 
problem 

(6.5c)  min  \\Di+ifiy  -  ||ro||2eil|2- 

yeR' 

This  leads  to  the  simplest  form  of  the  QMR  method. 
A  more  general  form  arises  if  the  least  squares  prob¬ 
lem  (6.5c)  is  replaced  by  a  weighted  least  squares 
problem.  No  strategies  are  yet  known  for  optimal 
weights,  however. 

In  [32]  the  QMR  method  is  carried  out  on  top  of 
a  look-ahead  variant  of  the  bi-orthogonal  Lanczos 
method,  which  makes  the  method  more  robust.  Ex¬ 
periments  suggest  that  QMR  has  a  much  smoother 
convergence  behaviour  than  Bi-CG,  but  it  is  not  es¬ 
sentially  faster  than  Bi-CG. 

6.5.3  CGS 


to  [43]. 

6.5.2  QMR 

The  QMR  method  [32]  relates  to  Bi-CG  in  a  simi¬ 
lar  way  as  MINRES  relates  to  CG.  For  stability  rea¬ 
sons  the  basis  vectors  rj  and  rj  are  normalized  (as 
is  usual  in  the  underlying  Bi-Lanczos  algorithm,  see 
[94]),  which  leads  to  other  coefficients  in  the  3-term 
recursion  formulas. 

If  we  group  the  residual  vectors  rj,  for  j  =  0, ...,  f  —  1 
in  a  matrix  Ri,  then  we  can  write  the  recurrence  re¬ 
lations  as 

ARi  =  Ri+iTi, 

with 

t 

i+1 

i 

Similar  as  for  MINRES  we  would  like  to  construct 
the  Xi,  with 

Xi  €  {ro,Aro,. .  .,A^~'^ro},  Xi  =  Rty 


For  the  bi-conjugate  gradient  residual  vectors  it 
is  well-known  that  they  can  be  written  as  rj  = 
Pj{A)ro  and  fj  =  Pj{A'^)fo,  and  because  of  the  bi¬ 
orthogonality  relation  we  have  that 

{rj,fi)  =  {Pj{A)ro,PiiA'^)fo) 

=  (Pi(A)P,(A)ro,fo)  =  0, 

for  i  <  j. 

The  iteration  parameters  for  bi-conjugate  gradients 
are  computed  from  innerproducts  like  the  above. 
Sonneveld  observed  that  we  can  also  construct  the 
vectors  fj  =  P^{A)ro,  using  only  the  latter  form  of 
the  innerproduct  for  recovering  the  bi-conjugate  gra¬ 
dients  parameters  (which  implicitly  define  the  poly¬ 
nomial  Pj).  By  doing  so,  it  can  be  avoided  that  the 
vectors  rj  have  to  be  formed,  nor  is  there  any  multi¬ 
plication  with  the  matrix  A'^ . 

The  resulting  CGS  [79]  method  works  in  general  very 
well  for  many  unsymmetric  linear  problems.  It  con¬ 
verges  often  much  faster  than  BI-CG  (about  twice  as 
fast  in  some  cases)  and  does  not  have  the  disadvan¬ 
tage  of  having  to  store  extra  vectors  like  in  GMRES. 
These  three  methods  have  been  compared  in  many 
studies  (see,  e.g.,  [67,  10,  65,  55]). 

However,  CGS  usually  shows  a  very  irregular  con¬ 
vergence  behaviour.  This  behaviour  can  even  lead 
to  cancellation  and  a  spoiled  solution  [86].  See  also 
section  6.5.4. 
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The  following  scheme  carries  out  the  CGS  process 
for  the  solution  of  Ax  -  b,  with  a  given  precondi¬ 
tioner  K: 

xo  is  an  initial  guess;  ro  =  6  —  Axo\ 
fo  is  an  arbitrary  vector,  such  that 
(^-o.ro)  /  0, 

e.g.,  ro  =  ro-,po  =  (ro,?o); 

/?-i  =  y3o;P-i  =  ?o  =  0; 

for  i  =  0,1,2,... 

Ui  -  ri  +  Pi_iqi; 

Pi  =  Ui  + 

solve  p  from  Kp  =  pi ; 

V  =  Ap\ 

~  ('^0,1')  ’ 

Qi+i  =Ui-  aiv; 

solve  it  from  Kit  —  m  -|-  qi+i 

Xi^i  =  Xi  A  OLiii] 

if  Xij^i  is  accurate  enough  then  quit; 
ri+i  -  Vi-  at  Ail-, 

Pi+i  =  (ro,n+i); 

if  pi^i  —  0  then  method  fails  to  converge  !; 


end 

In  exact  arithmetic,  the  aj  and  are  the  same  con¬ 
stants  as  those  generated  by  BiCG.  Therefore,  they 
can  be  used  to  compute  the  Petrov-Galerkin  approx¬ 
imations  for  eigenvalues  of  A. 

6.5.4  Effects  of  irregular  convergence 

By  very  irregular  convergence  we  refer  to  the  situa¬ 
tion  where  successive  residual  vectors  in  the  iterative 
process  differ  in  orders  of  magnitude  in  norm,  and 
some  of  these  residuals  may  be  even  much  bigger  in 
norm  than  the  starting  residual.  We  will  try  to  give 
an  impression  why  this  is  a  point  of  concern,  even 
if  eventually  the  (updated)  residual  satisfies  a  given 
tolerance.  For  more  details  we  refer  to  Sleijpen  et 
al[75,  77]. 

We  will  say  that  an  algorithm  is  accurate  for  a  cer¬ 
tain  problem  if  the  updated  residual  rj  and  the  true 
residual  b  —  Axj  are  of  comparable  size  for  the  j’s  of 
interest. 

The  best  we  can  hope  for  is  that  for  each  j  the  error 
in  the  residual  is  only  the  result  of  applying  A  to  the 
update  uij+i  for  Xj  in  finite  precision  arithmetic: 

(6.5d)  rj+i  -  rj  -  Awj+i  -  Aawj+i 

if 

(6.5e)  Xj+i  =  Xj +wj+i, 

for  each  _;,^where  is  an  n  x  n  matrix  for  which 
A  'nA^\A\-.  riA  is  the  maximum  number  of  non¬ 
zero  matrix_entries  per  row  of  A,  |il|  =  (m-|)  if 
■S  =  (bij),  ^  is  the  relative  machine  precision,  the 
inequality  <  refers  to  element-wise  <.  In  the  Bi-CG 
type  methods  that  we  consider,  we  compute  explicitly 


the  update  Awj  for  the  residual  rj  from  the  update 
Wj  for  the  approximation  by  matrix  multiplication: 
for  this  part,  (6.5d)  describes  well  the  local  deviations 
caused  by  evaluation  errors. 

In  the  “ideal”  case  (i.e.  situation  (6.5d)  whenever 
we  update  the  approximation)  we  have  that 

k 

ri,-{b-Axk)  = 

j=i 

k 

(6-5f)  =  y]A^(ej_i -ej), 

where  the  perturbation  matrix  A,4  may  depend  on  j 
and  ej  is  the  approximation  error  in  the  jth  approx¬ 
imation:  Cj  =  X  —  Xj.  Hence, 

(6.5g)  111^,11  _||6_  < 

2knA'^\\\A\\\J2\\ej\\< 

j=o 

2r?^||r,||, 

j=o 

where  F  =  |pl||  |iyl-^||. 

Except  for  the  factor  F,  the  last  upper-bound  ap¬ 
pears  to  be  rather  sharp.  We  see  that  approximations 
with  large  approximation  errors  may  ultimately  lead 
to  an  inaccurate  result.  Such  large  local  approxima¬ 
tion  errors  are  typical  for  CGS,  and  Van  der  Vorst[86] 
describes  an  example  of  the  resulting  numerical  in¬ 
accuracy  is  given.  If  there  are  a  number  of  approxi¬ 
mations  with  comparable  large  approximation  errors, 
then  their  multiplicity  may  replace  the  factor  k,  oth¬ 
erwise  it  will  be  only  the  largest  approximation  error 
that  makes  up  virtually  the  bound  for  the  deviation. 

Example.  Figure  3  illustrates  nicely  the  loss  of  accu¬ 
racy  as  described  above;  for  other  examples,  cf.  [86]. 
The  convergence  history  of  the  updated  residuals  (the 
‘circles’:  oo)  and  the  true  residuals  (the  solid  curve:  — 
-)  of  CGS  is  given  for  the  matrix  SHERMAN4  from 
the  Harwell-Boeing  set  of  test  matrices.  Here,  as  in 
other  figures,  the  norm  of  the  residuals,  on  log-scale, 
is  plotted  (along  the  vertical  axis)  against  the  num¬ 
ber  of  matrix-vector  multiplications  (along  the  hori¬ 
zontal  axis).  The  dotted  curve  ( - )  represents  the 

estimated  inaccuracy:  2'^J2j<i  Ikill  (here  with  r=l; 
cf.  (6.5g)). 

We  will  discuss  two  approaches  that  lead  to  a 
smoother  convergence. 

—  Approaches  to  obtain  the  smoothing  effect  by 
adding  a  few  lines  to  existing  codes  leave  the  speed  of 
convergence  essentially  unchanged.  One  of  these  ap¬ 
proaches  leads  to  optimal  accurate  approximations 
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number  of  matrix-vector  multiplications 

Fig. 3:  Convergence  plot  CGS  for  the  true  resid¬ 
uals  and  the  updated  residuals. 


[76]  and  will  be  discussed  in  Section  7.  For  other 
ones,  we  refer  to  the  literature  (e.g.,  [95]). 

—  In  the  next  section,  we  concentrate  on  techniques 
that  really  change  the  convergence;  they  smooth 
down  and  speed  up  the  convergence,  and  lead  to  more 
accurate  approximations,  all  at  the  same  time. 

6.5.5  Bi-CGSTAB 

Bi-CGSTAB  [86]  is  based  on  the  following  observa¬ 
tion.  Instead  of  squaring  the  Bi-CG  polynomial,  we 
can  construct  other  iteration  methods,  by  which  Xi 
are  generated  so  that  rj  =  Pi{A)Pi{A)ro  with  other 
i*'*  degree  polynomials  P.  An  obvious  possibility  is 
to  take  for  Pj  a  polynomial  of  the  form 

(6.5h)  Qi{x)  =  (1  -  ujix){l  -  W2a:)...(l  - 

and  to  select  suitable  constants  ljj  .  This  expression 
leads  to  an  almost  trivial  recurrence  relation  for  the 

Qi- 

In  Bi-CGSTAB  uij  in  the  iteration  step  is  chosen 
as  to  minimize  rj,  with  respect  to  ujj,  for  residuals 
that  can  be  written  as  rj  =  Qj{A)Pj{A)ro. 

The  preconditioned  Bi-CGSTAB  algorithm  for  solv¬ 
ing  the  linear  system  Ax  —  b,  with  preconditioning 
K  reads  as  follows: 

xq  is  an  initial  guess;  ro  =  b  —  Axa\ 
fo  is  an  arbitrary  vector,  such  that 
(Fo,ro)  0,  e.g.,  tq  =  tq; 

P_i  =  0-1  =  w-i  =  1; 
w-i  =  p_i  =  0; 

for  i  =  0, 1,2, ... 

Pi  -  -  {pi/pi-i){ai-i/u>i-i); 

Pi  =  pA  /?i_i(pi_i  - 
Solve  p  from  Kp  —  pi\ 

Vi  =  Ap] 

cti  =  pi/{ro,Vi)', 

s  =  ri  —  aiVi] 


if  l|s||  small  enough  then 
afj+i  -Xi  +  oiip\  quit; 

Solve  z  from  Kz  —  s\ 
t  —  Az] 

wi  =  {t,s)/{t,ty, 

Xi+i  =  Xi  +  aip  +  u>iZ-, 

if  Xi^i  is  accurate  enough  then  quit; 

Tj  +  l  =  S-UIit] 

end 

The  matrix  K  in  this  scheme  represents  the  precon¬ 
ditioning  matrix  and  the  way  of  preconditioning  [86] . 
The  above  scheme  in  fact  carries  out  the  Bi-CGSTAB 
procedure  for  the  explicitly  postconditioned  linear 
system 

AK-^y  =  b, 

but  the  vectors  yi  and  the  residual  have  been  back- 
transformed  to  the  vectors  Xi  and  ri  corresponding  to 
the  original  system  Ax  =  b.  Compared  to  CGS  two 
extra  innerproducts  need  to  be  calculated. 

In  exact  arithmetic,  the  aj  and  pj  have  the  same  val¬ 
ues  as  those  generated  by  Bi-CG  and  CGS.  Hence, 
they  can  be  used  to  extract  eigenvalue  approxima¬ 
tions  for  the  eigenvalues  of  A  (see  Bi-CG). 
Bi-CGSTAB  can  be  viewed  as  the  product  of  Bi-CG 
and  GMRES(l).  Of  course,  other  product  methods 
can  be  formulated  as  well.  Gutknecht  [38]  has  pro¬ 
posed  BiCGSTAB2,  which  is  constructed  as  the  prod¬ 
uct  of  Bi-CG  and  GMRES(2). 

6.5.6  Derivation  of  Bi-CGSTAB 

The  polynomial  Pi  and  related  polynomials  are  im¬ 
plicitly  defined  by  the  Bi-CG  scheme. 

Bi-CG: 

a^o  is  an  initial  guess;  ro  =  6  —  Axo', 
fo  is  an  arbitrary  vector,  such  that 
(fo,»’o)  7^  0,  e.g.,  fo  =  ro; 

Po  =  1; 
po-Po-  0; 
for  i  =  1,  2, 3, ... 

Pi  =  {fi-i,ri-i)',l3i  =  (Pi/pi-i); 

Pi  =  n-i  +  PiPi-i] 

Pi  =  fi-i  +  PiPi-i', 

Vi  =  Apii 

Oii  =  Pil{Pi,Vi)\ 

Xi  =  a;,-!  A  aiPi] 

if  Xi  is  accurate  enough  then  quit; 

ri  =  ri-i  —  cxiVi; 
fi  =  fi_i  -  OLiA^pi] 

end 

From  this  scheme  it  is  straight  forward  to  show  that 
ri  =  Pi{A)ro  and  pi+i  =  Ti{A)ro,  in  which  Pi(A) 
and  Ti{A)  are  z-th  degree  polynomials  in  A.  The  Bi- 
CG  scheme  then  defines  the  relations  between  these 
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polynomials: 

Ti{A)ro  =  (Pi(vi)  + /?,+iTi_i(vl))ro, 

and 

Pi{A)ro  =  (Pi_i(A)  -  aiATi-i{A))ro. 

In  the  Bi-CGSTAB  scheme  we  wish  to  have  recur¬ 
rence  relations  for 

ri  =  Qi{A)Pi{A)ro. 

With  Qi  as  in  (6.5h)  and  the  Bi-CG  relation  for  the 
factor  Pi  and  Ti,  it  then  follows  that 

QiiA)Pi{A)ro  = 

(1  -  u;iA)Qi-i{A){Pi-i{A)  -  aiATi_i{A))ro 

=  -  aiAQi-i{A)Ti^i{A)}ro 

-WiA{(Qi_i(A)Pi_i(A)  -  aiAQi.iiA)Ti-i{A))}ro. 

Clearly,  we  also  need  a  relation  for  the  product 
Qi{A)Ti{A)ro.  This  can  also  be  obtained  from  the 
Bi-CG  relations: 

QiiA)TiiA)ro  =  Qi{A)(PiiA)  +  /?i+i7;-_i(A))ro 
=  Qi{A)Pi{A)ro  +  A+i(l  -  L^iA)Qi-i{A)Ti-.i{A)ro 

=  Qi{A)Pi(A)ro  +  l3i^iQi-i{A)Ti-i{A)ro 

—l3i+iU!iAQi-i{A)Ti-i{A)ro. 

Finally  we  have  to  recover  the  Bi-CG  constants 
Pi,  Pi,  and  ai  by  innerproducts  in  terms  of  the  new 
vectors  that  we  now  have  generated. 

E.g.,  Pi  can  be  computed  as  follows.  First  we  com¬ 
pute 

Pi+i  -  ifo,Qi{A)Pi{A)ro)  -  iQiiA'^)fo,PiiA)rQ). 

By  construction  Pi{A)ro  is  orthogonal  with  respect 
to  all  vectors  Ui-i{A'^)fo,  where  Ui-\  is  an  arbitrary 
polynomial  of  degree  i  —  1  at  most.  This  means  that 
we  have  to  consider  only  the  highest  order  term  of 
Qi{A’^)  when  computing  pi+i.  This  term  is  given  by 
(—iyu>ii02  ■  ■  ■<^i{A'^y.  We  actually  wish  to  compute 

Pi+i  =  {Pi{A'^)ro,Pi{A)ro), 

and  since  the  highest  order  term  of  Pj(A^)  is  given 
by  (— l)*aia2  •  •  it  follows  that 

A  =  {Pi/pi-i)iai-i/ui-i). 

The  other  constants  can  be  derived  similarly. 

Note  that  in  our  discussion  we  have  focussed  on  the 
recurrence  relations  for  the  vectors  rj  and  pi ,  while  in 
fact  our  main  goal  is  to  determine  Xj .  As  in  all  CG- 
type  methods,  Xi  itself  is  not  required  for  continuing 


the  iteration,  but  it  can  easily  be  determined  as  a 
’’sideproduct”  by  realizing  that  an  update  of  the  form 
n  =  n-i—jAy  corresponds  to  an  update  Xi  = 
jy  for  the  current  approximated  solution. 

By  writing  ri  for  Qi{A)Pi(A)ro  and  pi  for 
Qi-i{A)Ti-i{A)rQ ,  we  obtain  the  following  scheme 
for  Bi-CGSTAB  (we  trust  that,  with  the  foregoing 
observations,  the  reader  will  now  be  able  to  verify 
the  relations  in  Bi-CGSTAB).  In  this  scheme  we  have 
computed  the  Wj  so  that  rj  =  Q,(A)Pi(A)ro  is  mini¬ 
mized  in  2-norm  as  a  function  of  Wj . 

6.5.7  Bi-CGSTAB2  and  variants 

The  residual  rfc  =  6  -  Axj,  in  the  Bi-Conjugate  Gra¬ 
dient  method,  when  applied  to  Ax  —  b  with  start  xq 
can  be  written  formally  as  Pk{A)ro,  where  P^  is  a  k- 
degree  polynomial.  These  residuals  are  constructed 
with  one  operation  with  A  and  one  with  per  iter¬ 
ation  step.  It  was  pointed  out  in  [79]  that  with  about 
the  same  amount  of  computational  effort  one  can  con¬ 
struct  residuals  of  the  form  fk  —  PkiA^o,  which  is 
the  basis  for  the  CGS  method.  This  can  be  achieved 
without  any  operation  with  A^ .  The  idea  behind  the 
improved  efficiency  of  CGS  is  that  if  Pk{A)  is  viewed 
as  a  reduction  operator  in  BiCG,  then  one  may  hope 
that  the  square  of  this  operator  will  be  a  twice  as 
powerful  reduction  operator.  Although  this  is  not  al¬ 
ways  observed  in  practice,  one  typically  has  that  CGS 
converges  faster  than  BiCG.  This,  together  with  the 
absence  of  operations  with  A^,  explains  the  success 
of  the  CGS  method.  A  drawback  of  CGS  is  that  its 
convergence  behavior  can  look  quite  erratic,  that  is 
the  norms  of  the  resdiduals  converge  quite  irregularly, 
and  it  may  easily  happen  that  ||rj;+i||2  is  much  larger 
than  ||ri;||2  for  certain  k  (for  an  explanation  of  this 
see  [84]). 

In  [86]  it  was  shown  that  by  a  similar  approach  as 
for  CGS,  one  can  construct  methods  for  which  rj,  can 
be  interpreted  as  r*  =  Pk{A)Qk{A)ro,  in  which  Pk  is 
the  polynomial  associated  with  BiCG  and  Qk  can  be 
selected  free  under  the  condition  that  Qfc(O)  =  1.  In 
[86]  it  was  suggested  to  construct  Qk  as  the  product 
of  k  linear  factors  I  —  uij A,  where  wj  was  taken  to 
minimize  locally  a  residual.  This  approach  leads  to 
the  BiCGSTAB  method.  Because  of  the  local  mini¬ 
mization,  BiCGSTAB  displays  a  much  smoother  con¬ 
vergence  behavior  than  CGS,  and  more  surprisingly 
it  often  also  converges  (slightly)  faster.  One  weak 
point  in  BiCGSTAB  is  that  we  get  break-down  if  an 
Wj  is  equal  to  zero.  One  may  equally  expect  negative 
effects  when  uij  is  small.  In  fact,  BiCGSTAB  can  be 
viewed  as  the  combined  effect  of  BiCG  and  GCR(l), 
or  GMRES(l),  steps.  As  soon  as  the  GCR(I)  part  of 
the  algorithm  (nearly)  stagnates,  then  the  BiCG  part 
in  the  next  iteration  step  cannot  (or  only  poorly)  be 
constructed. 
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Another  dubious  aspect  of  BiCGSTAB  is  that  the  fac¬ 
tor  Qfc  has  only  real  roots  by  construction.  It  is  well- 
known  that  optimal  reduction  polynomials  for  matri¬ 
ces  with  complex  eigenvalues  may  have  complex  roots 
as  well.  If,  for  instance,  the  matrix  A  is  real  skew- 
symmetric,  then  GCR(l)  stagnates  forever,  whereas 
a  method  like  GCR(2)  (or  GMRES(2)),  in  which  we 
minimize  over  two  combined  successive  search  direc¬ 
tions,  may  lead  to  convergence,  and  this  is  mainly 
due  to  the  fact  that  then  complex  eigenvalue  compo¬ 
nents  in  the  error  can  be  effectively  reduced. 

This  point  of  view  was  taken  in  [38]  for  the  con¬ 
struction  of  the  BiCGSTAB2  method.  In  the  odd- 
numbered  iteration  steps  the  Q-polynomial  is  ex¬ 
panded  by  a  linear  factor,  as  in  BiCGSTAB,  but 
in  the  even-numbered  steps  this  linear  factor  is  dis¬ 
carded,  and  the  Q-polynomial  from  the  previous 
even-numbered  step  is  expanded  by  a  quadratic  1  — 
akA  —  /3k A^.  For  this  construction  the  information 
from  the  odd-numbered  step  is  required.  It  was  antic¬ 
ipated  that  the  introduction  of  quadratic  factors  in  Q 
might  help  to  improve  convergence  for  systems  with 
complex  eigenvalues,  and,  indeed,  some  improvement 
was  observed  in  practical  situations  (see  also  [64]). 
However,  our  presentation  suggests  a  possible  weak¬ 
ness  in  the  construction  of  BiCGSTAB2,  namely 
in  the  odd-numbered  steps  the  same  problems  may 
occur  as  in  BiCGSTAB.  Since  the  even-numbered 
steps  rely  on  the  results  of  the  odd-numbered  steps, 
this  may  equally  lead  to  unnecessary  break-downs  or 
poor  convergence.  In  [78]  another,  and  even  simpler 
approach  was  taken  to  arrive  at  the  desired  even- 
numbered  steps,  without  the  necessity  of  the  con¬ 
struction  of  the  intermediate  BiCGSTAB- type  step 
in  the  odd-numbered  steps.  Hence,  in  this  approach 
the  polynomial  Q  is  constructed  straight-away  as  a 
product  of  quadratic  factors,  without  ever  construct¬ 
ing  a  linear  factor.  As  a  result  the  new  method 
BiCGSTAB(2)  leads  only  to  significant  residuals  in 
the  even-numbered  steps  and  the  odd-numbered  steps 
do  not  lead  necessarily  to  useful  approximations. 

In  fact,  it  is  shown  in  [78]  that  the  polynomial  Q 
can  also  be  constructed  as  the  product  of  Adegree 
factors,  without  the  construction  of  the  intermedi¬ 
ate  lower  degree  factors.  The  main  idea  is  that  £ 
successive  BiCG  steps  are  carried  out,  where  for  the 
sake  of  an  A^-free  construction  the  already  available 
part  of  Q  is  expanded  by  simple  powers  of  A.  This 
means  that  after  the  BICG  part  of  the  algorithm 
vectors  from  the  Krylov  subspace  s,As,A^s, ...,  A^s, 
with  s  =  Pk{A)Qk-{.(A)ro  are  available,  and  it  is  then 
relatively  easy  to  minimize  the  residual  over  that  par¬ 
ticular  Krylov  subspace.  There  are  variants  of  this 
approach  in  which  more  stable  bases  for  the  Krylov 
subspaces  are  generated  [77],  but  for  low  values  of  ^ 
a  standard  basis  satisfies,  together  with  a  minimum 
norm  solution  obtained  through  solving  the  associ¬ 


ated  normal  equations  (which  requires  the  solution 
of  an  f  by  .f  system.  In  most  cases  BiCGSTAB(2) 
will  already  give  nice  results  for  problems  where  Bi¬ 
CGSTAB  or  BiCGSTAB2  may  fail.  Note,  however, 
that,  in  exact  arithmetic,  if  no  break-down  situation 
occurs,  BiCGSTAB2  would  produce  exactly  the  same 
results  as  BiCGSTAB(2)  at  the  even-numbered  steps. 

Bi-CGSTAB(2)  can  be  represented  by  the  following 
algorithm: 

xq  is  an  initial  guess;  ro  —  b  —  Axq', 
fo  is  an  arbitrary  vector, 
such  that  (r,  fo)  0, 
e.g.,  fo  =  r; 

po  =  1;  u  =  0;  a  =  0;  W2  =  1; 

for  i  =  0,2, 4,  6, ... 

PO  =  —UJ2P0 

even  BiCG  step: 

Pi  =  {fo,ri)-,P-  api/poiPo  =  Pi 
u  =  Vi-  /?«; 

V  =  Au 

7  =  (j;,fo);a  =  po/r, 
r  —  n  —  av; 
s  =  Ar 
X  —  Xi  +  au; 

odd  BiCG  step: 

Pi  =  (fo,s);^  =  api/po;po  =  Pi 

V  =  s  —  fiv; 
w  =  Av 

7  =  {w,fo);a  =  po/r, 
u  =  r  —  f3u 
r  =  r  —  av 
s  =  s  —  aw 
t  =  As 
GCR(2)-part: 

wi  -  {r,s);p-  (s,s); 

u  =  {s,t);T= 

(^2  =  {r,t);T=T-u^/p; 

W2  -  (W2  -  I/U!ilp)fT; 

Wi  =  (wi  -  l>U)2)/p 

Xi+2  —  X  A-  u)ir  4-  W2S  +  au 

A+2  =  r-WiS-UJ2t 

if  a;,.|.2  accurate  enough  then  quit 

U  =  U  —  LJiV  —  U)2W 

end 

For  more  general  BiCGSTAB(^)  schemes  see  [78, 
77]. 

Another  advantage  of  BiCGSTAB(2)  over  BiCG- 
STAB2  is  in  its  efficiency.  The  BiCGSTAB(2)  al¬ 
gorithm  requires  14  vector  updates,  9  innerproducts 
and  4  matrix  vector  products  per  full  cycle.  This 
has  to  be  compared  with  a  combined  odd-numbered 
and  even-numbered  step  in  BiCGSTAB2,  which  re¬ 
quires  22  vector  updates,  11  innerproducts,  and  4 
matrix  vector  products,  and  with  two  steps  of  Bi¬ 
CGSTAB  which  require  4  matrix  vector  products,  8 
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innerproducts  and  12  vector  updates.  The  numbers 
for  BiCGSTAB2  are  based  on  an  implementation  de¬ 
scribed  in  [64]. 

Also  with  respect  to  memory  requirements,  BiCG- 
STAB(2)  takes  an  intermediate  position:  it  requires 
2  n-vectors  more  than  BiCGSTAB  and  2  n-vectors  | 
less  than  BiCGSTAB2.  | 

For  distributed  memory  machines  the  innerprod-  S 
ucts  may  cause  communication  overhead  problems  2 
(see,  e.g.,  [16]).  We  note  that  the  BiCG  steps  are  ^ 
very  similar  to  conjugate  gradient  iteration  steps,  so 
that  we  may  consider  all  kind  of  tricks  that  have  been 
suggested  to  reduce  the  number  of  synchronization 
points  caused  by  the  4  innerproducts  in  the  BiCG 
parts.  For  an  overview  of  these  approaches  see  [6]. 

If  on  a  specific  computer  it  is  possible  to  overlap 
communication  with  communication,  then  the  BiCG 
parts  can  be  rescheduled  as  to  create  overlap  possi- 
billities;  1.  the  computation  of  p\  in  the  even  BiCG 
step  may  be  done  just  before  the  update  of  u  at  the 
end  of  the  GCR  part. 

2.  The  update  of  Xi+2  may  be  delayed  until  after  the 
computation  of  7  in  the  even  BiCG  step. 

3.  The  computation  of  pi  for  the  odd  BiCG  step  can 
be  done  just  before  the  update  for  x  at  the  end  of  the 
even  BiCG  step. 

4.  The  computation  of  7  in  the  odd  BiCG  step  has 
already  overlap  possibillities  with  the  update  for  u. 

For  the  GCR(2)  part  we  note  that  the  5  innerprod¬ 
ucts  can  be  taken  together,  in  order  to  reduce  start¬ 
up  times  for  their  global  assembling.  This  gives  the 
method  BiCGSTAB(2)  a  (slight)  advantage  over  Bi¬ 
CGSTAB.  Furthermore  we  note  that  the  updates  in 
the  GCR(2)  may  lead  to  more  efficient  code  than  for 
BiCGSTAB,  since  some  of  them  can  be  combined. 

Our  next  numerical  example  illustrates  quite  nicely 
the  difference  in  convergence  behavior  of  some  of  the 
methods  that  we  have  discussed. 

Example.  We  consider  an  advection  dominated  2nd 
order  PDE,  with  Dirichlet  boundary  conditions,  on 
the  unit  cube  (this  equation  was  taken  from  [50]): 

(6.5i)  y>xx  ^yy  '^zz  T  1000  —  f. 

The  function  /  is  defined  by  the  solution 

u{x,  y,  z)  =  xyz(l  -  x)(l  -  y){l  -  z). 

This  equation  was  discretized  using  22  X  22  x  22  vol¬ 
umes,  resulting  in  a  seven-diagonal  linear  system  of 
order  10648.  In  order  to  make  differences  between  it¬ 
erative  methods  more  visible,  we  have  here  and  in  our 
other  examples  not  use  any  form  of  preconditioning. 

In  Figure  4  we  see  a  plot  of  the  convergence  history. 
Bi-CGSTAB  almost  stagnates,  as  might  be  antici¬ 
pated  from  the  fact  that  this  linear  system  has  eigen¬ 
values  with  relatively  large  imaginary  parts.  Surpris¬ 
ingly,  Bi-CGSTAB  does  even  worse  than  Bi-CG.  For 
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this  type  of  matrices  this  behavior  of  Bi-CGSTAB  is 
not  uncommon  and,  as  we  will  see  in  the  next  sub¬ 
section,  this  can  be  explained  by  the  poor  recovery 
of  the  Bi-CG  iteration  coefficients  a*  and  /3k.  Bi- 
CGstab(2)  converges  quite  nicely  and  almost  twice 
as  fast  as  Bi-CG.  GMRES(25)  is  about  as  fast  as  Bi- 
-CG.  Since  the  GMRES  steps  are  much  more  expen¬ 
sive,  BiCGstab(2)  is  the  most  efficient  method  here. 

6.6  Maintaining  Convergence: 

The  BiCGstab  methods  are  designed  for  smooth  con¬ 
vergence,  with  the  purpose  to  avoid  loss  of  local  bi¬ 
orthogonality  in  the  underlying  Bi-CG  process.  This 
is  important,  since  then  the  convergence  of  the  Bi- 
-CG  part  is  exploited  as  much  as  possible.  However, 
local  bi-orthogonality  may  also  be  disturbed  by,  for 
instance,  inaccuracies  in  the  Bi-CG  coefficients  a  and 
P.  They  are  the  quotients  of  scalars  p  =  (ri,ro)  and 
7  =  (Ap,  fo)  (see  the  algorithms  for  BiCGSTAB  and 
BiCGSTAB(2))  and  they  will  be  inaccurate  if  p  or  7 
is  relatively  small  (see  (6.6b)).  The  question  is,  when 
does  this  occur  and  how  can  it  be  avoided?  Here, 
we  will  concentrate  on  p  only,  but  similar  arguments 
apply  to  7  as  well. 

As  in  the  introduction  of  this  section,  Vi  is  the  resid¬ 
ual  Ti  =  Pi{A)Pi{A)ro  where  Pi  is  an  appropriate 
polynomial  of  degree  i  with  Pt(0)  =  1-  Now,  p  is 
given  by 

(6.6a)  p  =  Pi  =  {Pi{A)Pi{A)ro  ,ro) . 

The  scalar  pi  can  be  small  if  the  underly¬ 
ing  Bi-Lanczos  process  nearly  breaks  down  (i.e. 
{Pi{A)Pi{A)ro,ro)  ss  0  relatively,  for  any  polynomial 
Pi  of  degree  i).  Also  an  ‘unlucky’  choice  of  Pi  may 
lead  to  a  small  pi  (which  occurs  in  Bi-CGSTAB  if  the 
GCR(l)  part  stagnates).  Here,  we  will  concentrate 
on  typical  Bi-CGSTAB  situations.  Therefore,  we  as¬ 
sume  that  the  Bi-Lanczos  process  itself  (and  the  LU 
decomposition)  does  not  (nearly)  break  down. 
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The  relative  rounding  error  tj  in  pi  can  relatively 
and  sharply  be  bounded  by 


(6.6b) 


-(Ir, 

<  nC 


< 


|n||||ro[| 

l(n,ro)|  ■ 


The  GMRES  polynomial  of  degree  i  solves  (6.6d), 
the  FOM  polynomial  qf  solves  (6.6e).  For  small 
residuals,  the  FOM  polynomial  is  not  optimal: 

\\q^iA)s\\  =  \ci\\\qriA)s\\ 


For  a  small  relative  error  we  want  to  have  the  expres¬ 
sion  at  the  right-hand  side  as  small  as  possible. 

Since  the  Bi-CG  residual  Pi{A)ro,  here  to  be  de¬ 
noted  by  Si,  is  orthogonal  to  it  follows 

that 

iri,ro)  -  di{A^Si,ro) 


if 

PiiA)  6iA‘  A  +  •  •  •  • 

Therefore,  since  ||fo||/|(M®Se, fo)|  does  not  depend  on 
Pi,  minimizing  the  right-hand  side  of  (6.6b)  is  equiv¬ 
alent  to  minimizing 


(6.6c) 


m 


with  respect  to  all  polynomials  Pi  of  exact  degree  i 
with  Pi{0)  —  1.  This  minimization  problem  is  solved 
by  the  FOM  polynomial  P^^ ,  here  associated  with  the 
initial  residual  Sj-:  P-^  is  the  degree  polynomial 
for  which  rf  =  P^^(A)si  (cf.  Section  6.2).  This 
polynomial  is  characterized  by: 

P,^ {A)si  T  ICi{A;si)  and  P,-^(0)  =  1. 

For  optimally  accurate  coefficients,  we  should  se¬ 
lect  FOM  polynomials  for  our  polynomials  Pi .  How¬ 
ever,  since  the  hybrid  Bi-CG  methods  are  designed  to 
avoid  all  the  work  for  the  construction  of  an  orthogo¬ 
nal  basis,  the  selection  of  complete  FOM  polynomials 
is  out  of  the  question. 

For  efficiency  reasons,  we  have  used  products  of 
first  degree  polynomials  in  Bi-CGSTAB  and  products 
of  degree  I  polynomials  in  BiCGstab(^).  Of  course, 
our  arguments  can  also  be  applied  to  such  low  degree 
factors.  Therefore,  suppose  that  s  —  Qi-i{A)Pi{A)ro 
(as  BiCGstab(^))  has  been  computed  and  that  the 
vectors  s,  As,  . . . ,  A^s  are  available.  The  suggestion 
for  BiCGstab(^)  to  minimize  the  residual  over  this 
particular  Krylov  subspace  is  equivalent  to  selecting 
a  polynomial  factor  qi  (Q,-  =  qiQi-e)  of  exact  degree 
t  with  5i(0)  =  1  such  that 


(6.6d)  l|gi(^)s|| 

is  minimal,  while  in  this  situation,  for  optimal  accu¬ 
rate  coefficients,  we  rather  would  like  to  minimize 


(6.6e)  |3i(^)s| 

where,  with  9i  such  that  qi{A)  =  OiA^ 


(6.6f) 


|gi(A)s| 


with  Ci  as  in  (6. 2d).  Similarly,  for  accurate  coeffi¬ 
cients,  the  GMRFS  polynomial  is  not  optimal  [75]: 


|9f(^)s|  =  \ci\ 


with  the  same  scalar  c^.  For  degree  1  factors,  as  in 
Bi-CGSTAB,  (assuming  no  preconditioning) 


(6.6g) 


a  - 


s||||As||’ 


and  Ci  is  the  cosine  of  the  angle  between  s  and  As  (in 
the  BiCGSTAB  algorithm,  t  represents  As). 

Clearly,  for  extremely  small  |cj|,  say  |c,|  <  \/4  (in 
the  i  =  1  case,  this  means  that  s  and  As  are  almost 
orthogonal),  taking  GMRFS  polynomials  for  the  de¬ 
gree  i  factors  will  lead  to  inaccurate  coefficients  pi ,  a 
and  /?,  while  FOM  polynomials  on  the  other  hand  will 
lead  to  large  residuals.  In  both  situations,  the  speed 
of  convergence  will  seriously  be  deteriorated.  The 
same  phenomena  can  be  observed  when  in  a  consec¬ 
utive  number  of  sweeps  \ci\  is  small,  but  not  neces¬ 
sarily  extremely  small  (say,  it  takes  k  sweeps  before 
■  ■  ■Ci  \  <  v*T)-  In  other  words,  the  inaccu¬ 
racies  seem  to  accumulate.  This  seems  to  occur  quite 
often  in  practise.  F.g.,  for  linear  equation  stemming 
from  PDFs  with  large  advection  terms,  Bi-CGSTAB 
often  stagnates,  although  all  c,  may  be  larger  than, 
say  .1,  and  none  of  the  cv,  can  considered  to  be  rela¬ 
tively  small  (wj  =  Ci||s||/||As||). 


Both  Bi-CGSTAB  and  BiCGstab(^)  are  built  on 
top  of  the  same  Bi-CG  process.  At  roughly  the  same 
computational  costs,  one  sweep  of  BiCGstab(^)  cov¬ 
ers  the  same  Bi-CG  track  as  I  sweeps  of  Bi-CGSTAB. 
In  one  sweep  of  BiCGstab(f),  GMRFS(^)  is  applied 
once,  in  t  sweeps  of  BiCGSTAB,  GMRFS(l)  is  ap¬ 
plied  I  times.  For  two  reasons  it  pays  off  to  use  GM- 
RFS(^)  instead  of  ^xGMRFS(l): 

1.  Due  the  super-linear  convergence,  one  sweep  of 
GMRES(£)  may  be  expected  to  give  a  better  residual 
reduction  than  I  times  GMRES(l). 

2.  In  i  steps  of  GMRFS(l),  I  small  c^’s  may  con¬ 
tribute  to  inaccuracies  in  the  coefficients  a  and  fi, 
where  GMRES (.^)  contributes  ato  this  only  once. 
BiCGstab(£)  profits  from  GMRES(^)  by  a  better 
residual  reduction  in  the  GMRES  part  and  by  the 
faster  convergence  of  a  better  recovered  Bi-CG  due 
to  the  more  stable  computations.  However,  we  do 
not  recommend  to  take  I  large;  ^  =  2  or  ^  =  4  will 
usually  lead  already  to  almost  optimal  speed  of  con¬ 
vergence.  The  computational  costs  increase  slightly 


number  of  matrix-vector  multiplications  number  of  matrix-vector  multiplications 

Fig. 6:  Convergence  stabilized  Bi-CGSTAB.  Fig. 8:  Convergence  stab.  BiCGstab(2). 


by  increasing  I  (i.e.  2.^+10  vector  updates  and  l+l  in¬ 
ner  products  per  4  matrix  multiplications),  and  more 
vectors  have  to  be  stored  {2£+  5  vectors).  Moreover, 
the  method  is  less  accurate  for  larger  £  due  to  the  fact 
that  intermediate  residuals  (as  r  and  r  —  WiS  in  the 
Bi-CGSTAB(2)  algorithm)  can  be  large,  with  similar 
negative  effects  as  in  Section  6.5.4. 

For  Bi-CGSTAB  there  is  a  simple  strategy  that  re¬ 
laxes  the  danger  of  error  amplification  in  consecutive 
sweeps  with  small  |cj|:  replace  in  the  Bi-CGSTAB 
algorithm  the  line 

‘w  =  {s,t)/{t,ty 

by  the  piece  of  code  in  Algorithm  1.  In  this  way  we 
limit  the  size  of  |c|.  The  constant  .7  is  rather  arbitrar¬ 
ily  and  may  be  replaced  by  any  other  fixed  non-small 
constant  less  than  1.  Since  GMRES(l)  reduces  well 
only  if  |cj|  Ri  1  (see  (6.2c)),  this  strategy  still  prof¬ 
its  from  a  possible  good  reduction  by  GMRES(l).  A 
similar  strategy  that  is  equally  cheap  and  easy  to  im¬ 
plement  can  be  applied  to  BiCGstab(£);  see  [75]  for 


details. 

We  give  a  few  numerical  examples  that  demon¬ 
strate  the  cumulative  effects  of  small  |c|’s  and  that 
illustrate  the  effects  of  limiting  its  sizes. 

Examples.  The  figures  for  the  examples  display,  all 
on  log-scale,  the  values  for  each  iteration  step  of 

-  the  residual-norms  ||r||,  by  solid  curves  ( - ); 

-the  scaled  p,  p=  |(^,  ?^o)|/(|l?’||  ||?o||)  (cf.  (6.6b)),  by 
dashed-dotted  curves  ( ); 

-  the  scaled  7:  7  =  |(Ap,fo)|/(||Ap||  ||fo||),  by  dotted 

curves  ( ); 

-  |c|,  resp.  max(|c|,  0.7),  by  bullets  (•••)■ 

Before  describing  the  examples,  we  will  discuss  part 

of  the  results. 

In  the  figures  5-16,  we  see  that  the  scaled  p  and  the 
scaled  7  behave  similarly  (the  dashed-dotted  -  and 
dotted  curves  coincide  more  or  less).  Further,  none 
of  the  |c|  is  extremely  small  even  not  in  cases  where 
the  p  and  7  are.  The  decrease  of  p  for  values  of  p 
not  in  the  range  of  the  machine  precision  (>  10“'^^) 
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Alg.l:  Limiting  the  size  of  |c 
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Fig. 9:  Convergence  Bi-CGSTAB. 


seems  to  be  proportional  to  the  product  of  previou; 
|c|’s.  In  all  the  examples,  the  method  stagnates  if  p  ’ 
or  7 ’s  become  extremely  small,  say  less  than  10“^^ 
In  these  cases,  almost  all  significance  of  the  Bi-CC 
coefficients  a  and  j3  will  be  lost.  Limiting  the  size  Oi 
|c|  (Algorithm  1)  slows  down  the  decrease  of  p  and 
7.  In  the  caption  of  the  figures,  we  used  the  adjec¬ 
tive  ‘stabilized’  to  indicate  that  we  used  the  limiting 
strategy.  Often  ‘stabilizing’  is  enough  to  overcome 
the  stagnation  phase,  and  to  lead  to  a  converging 
process. 

Example  1  (Figures  5-8).  BiCGstab(2)  converges. 
Although  stabilizing  Bi-CGSTAB  leads  to  more  ac¬ 
curate  Bi-CG  coefficients  in  the  initial  phase  of  the 
process,  this  is  apparently  not  enough  to  restore  full 
convergence. 

Example  2  (Figures  9-12).  Increasing  ^  to  ^  =  2 
leads  to  a  slowly  converging  BiCGstab(2)  process 
(many  more  than  300  matrix  vector  multiplications 
are  needed;  not  shown  in  the  graph).  Our  simple 
stabilizing  strategy  works  well  here. 

Example  3  (Figures  13-16).  The  combined  improve¬ 
ments,  stabilizing  and  increasing  ^  to  .^  =  2,  are  nec¬ 
essary  for  convergence. 

For  the  first  example,  we  have  taken  the  PDF  of 
(6.5i).  The  righ-hand  side  /  is  defined  by  the  solution 
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Fig. 10;  Convergence  stabilized  Bi-CGSTAB. 
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Fig.ll:  Convergence  BiCGstab(2). 


u{x,  y,  z)  =  exp(xyz)  sin(7ra;)  sin(7r?/)  sin(7r2’). 

The  discretization  is  with  10  x  10  x  10  finite  volumes 
(no  preconditioner  has  been  used). 

In  the  second  and  third  example  [33,  70],  we  have 
discretized 

'O'xx  '^yy  T  ai^X'iix  T  2/^y)  T  —  f 

on  the  unit-square  with  Dirichlet  boundary  condi¬ 
tions,  with  63  X  63  finite  volumes,  taking  a  =  100 
and  b  —  —200,  respectively  66  x  66,  a  —  1000  and 
6  =  10  (no  preconditioner  has  been  used).  The  func¬ 
tion  /  is  such  that  the  discrete  solution  is  constant  1 
(on  the  grid). 

6.7  Generalized  CGS: 

We  have  now  discussed  in  some  detail  the  family  of 
BiCGstab(^)  methods,  but  one  should  not  deduce 
from  this  that  these  methods  are  to  be  preferred  over 
CGS  in  all  circumstances.  We  have  had  very  good 
experiences  with  CGS  in  the  context  of  solving  non¬ 
linear  problems  with  Newton’s  method.  It  turns  out 
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Fig. 12:  Convergence  stab.  BiCGstab(2). 
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Fig. 13:  Convergence  Bi-CGSTAB. 


that  we  can  exploit  some  of  the  presented  ideas  also 
to  improve  on  CGS  itself. 

In  the  Newton  method  one  has  to  solve  a  Jacobian 
system  for  the  correction.  This  can  be  done  by  any 
method  of  choice,  e.g.,  CGS  or  BiCGstab(^).  Often 
fewer  Newton  steps  are  required  to  solve  a  non-linear 
problem  accurately  when  using  CGS.  Although  the 
BiCGstab  methods  tend  to  solve  each  of  the  linear 
systems  (defined  by  the  Jacobi  matrices)  faster,  the 
computational  gain  in  these  inner  loops  does  not  al¬ 
ways  compensate  for  the  loss  in  the  outer  loop  be¬ 
cause  of  more  Newton  steps. 

This  phenomenon  can  be  understood  as  follows. 
For  eigenvalues  A  that  are  extremal  in  the  convex  hull 
of  the  set  of  all  eigenvalues  of  A  (the  Jacobian  ma¬ 
trix),  the  values  Pi{X)  of  the  Bi-CG  polynomials  P,- 
tend  to  converge  more  rapidly  towards  zero  than  for 
eigenvalues  A  in  the  interior.  Since  CGS  squares  the 
Bi-CG  polynomials,  CGS  may  be  expected  to  reduce 
extremely  well  the  components  of  the  initial  residual 
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Fig.  14:  Convergence  stabilized  Bi-CGSTAB. 
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Fig. 15:  Convergence  BiCGstab(2). 


To  in  the  direction  of  the  eigenvectors  associated  with 
extremal  eigenvalues  A:  with  reduction  factor  Pi(A)^. 
Of  course,  the  value  Pt(A)  can  also  be  large,  specif¬ 
ically  for  interior  eigenvalues  and  in  an  initial  stage 
of  the  process.  CGS  amplifies  the  associated  com¬ 
ponents,  (which  also  explains  the  typical  irregular 
convergence  behavior  of  CGS).  The  BiCGstab  poly¬ 
nomial  Qi  does  not  have  this  tendency  of  favoring 
the  extremal  eigenvalues.  Therefore,  the  BiCGstab 
methods  tend  to  reduce  all  eigenvector  components 
equally  well:  on  average,  the  “interior  components” 
of  a  BiCGstab  residual  Tj  are  smaller  than  the  cor¬ 
responding  components  of  a  CGS  residual  r,-,  while, 
with  respect  to  the  exterior  components  the  situation 
is  the  other  way  around.  However,  the  non-linearity 
of  a  non-linear  problem  seems  often  to  be  represented 
rather  well  by  the  space  spanned  by  the  “extremal 
eigenvectors” .  With  respect  to  this  space,  and  hence 
with  respect  to  the  complete  space.  Newtons  scheme 
with  CGS  behaves  like  an  exact  Newton  scheme. 
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Fig. 16:  Convergence  stab.  BiCGstab(2). 


We  would  like  to  preserve  this  property  when  con¬ 
structing  iterative  schemes  for  Newton  iterations. 
Fokkema  et  al  [29]  suggest  polynomials  Pj  that  lead 
to  efficient  algorithms  (small  modifications  of  the 
CGS  algorithm)  with  a  convergence  that  is  slightly 
smoother,  faster,  and  more  accurate,  than  for  CGS, 
but  that  still  has  the  property  of  reducing  extremal 
components  quadratically.  As  a  linear  solver  for 
isolated  linear  problems  these  “generalized  CGS” 
schemes  do  not  seem  to  have  much  advantage  over 
BiCGstab(^),  but  as  a  linear  solver  in  a  Newton 
scheme  for  non-linear  problems,  they  often  do  rather 
well. 

7  RELIABLE  UPDATING 

In  all  the  Bi-CG  related  methods  we  see  that  the 
approximation  for  x  and  the  residual  vector  r  are 
updated  by  different  vectors,  and  that  the  value  for 
X  does  not  influence  the  further  iteration  process, 
whereas  the  value  for  r  does.  In  exact  arithmetic 
the  updated  r  is  equal  to  the  true  residual  b  —  Ax, 
but  in  rounded  arithmetic  it  is  unavoidable  that  dif¬ 
ferences  between  r  and  b  —  Ax  arise.  This  means  that 
we  may  be  misled  for  our  stopping  criteria,  which  are 
usually  based  upon  knowledge  of  the  updated  r  (and 
that  we  may  have  iterated  too  far  in  vain). 

In  this  section  we  will  discuss  some  techniques  that 
have  been  proposed  recently  for  the  improvement  of 
the  updating  steps.  It  turns  out  that  this  can  be 
settled  by  relatively  easy  means. 

Although  the  techniques  in  the  previous  section  led 
to  smoother  and  faster  convergence  and  more  accu¬ 
rate  approximations  the  approximation  may  still  not 
as  accurate  as  possible.  Here,  we  strive  for  optimal 
accuracy,  i.e  the  updated  r,  should  be  very  close  to 
the  values  of  6  —  Axi,  while  leaving  the  convergence 
of  the  updated  r  intact. 

First,  we  observe  that  even  if  Xm  is  the  exact  so¬ 


lution  then  the  residual,  computed  in  rounded  arith¬ 
metic  as  6  —  Axm,  may  not  be  expected  to  be  zero: 
using  the  notation  of  Section  6.5.4, 

\\b-Ax^\\  <  e  (1^611 +  n^|||A|||l|2:,„||) 

(7.1a)  <  2rJ|16||. 

Therefore,  the  best  we  can  strive  for  is  an  approxima¬ 
tion  Xm  for  which  the  true  residual  and  the  updated 
one  differ  in  order  of  magnitude  by  the  initial  resid¬ 
ual  times  the  relative  machine  precision  lko||); 
recall  that  we  assumed  xq  —  0,  and  hence  ro  —  b). 

Now  it  becomes  also  obvious  why  it  is  a  bad  idea  to 
replace  the  updated  residual  in  each  step  by  the  true 
one.  Except  from  the  fact  that  this  would  cost  an  ad¬ 
ditional  matrix  vector  multiplication  in  each  step,  it 
also  introduces  errors  in  the  recursions  for  the  resid¬ 
uals.  Although  these  errors  may  be  expected  to  be 
small  relatively  to  ro,  they  will  be  large  relatively 
to  Tj  if  ||rj||  <C  ||?’o||-  This  perturbs  the  local  bi¬ 
orthogonality  of  the  underlying  Bi-CG  process  and 
it  may  significantly  slow  down  the  speed  of  conver¬ 
gence.  This  observation  suggests  to  replace  the  up¬ 
dated  residual  by  the  true  one  only  if  the  updated 
residual  has  the  same  order  of  magnitude  as  the  ini¬ 
tial  residual.  However,  meanwhile  Xi  and  Vi  may 
have  drifted  apart,  and  replacing  r,  by  b  —  Axi  brings 
in  the  “error  of  Xj”  in  the  recursion  (bounded  as  in 
(6.5g)),  and  again  the  speed  of  convergence  may  be 
affected.  Although  it  is  a  good  idea  to  use  true  resid¬ 
uals  at  strategic  places,  the  approximation  Xi  should 
first  be  ‘tied’  more  closely  to  the  updated  residual  Tj-. 
We  can  achieve  this  by  updating  Xi  cumulatively:  if 
Xi  =  xq  +  wi  +  . .  .A  Wi  (cf.  (6.5d))  then  we  actually 
compute  Xi  in  groups  as 

(7.1b)  Xi  =  xqA  x{a  x'2A  ■ .  ■ 

where,  for  some  decreasing  sequence  of  indices  7r(l)  = 
1,  7r(2),  . . . ,  x'j  represents  the  sum  of  a  group; 

=  '^Tr{i)  +  ^7r(j)-(-i  -b  . .  .-k  etc.. 

Simultaneously,  we  compute  r,-  as 

(7.1c)  Ti  =  To  —  Ax'i  —  Ax'2  —  . . . . 

In  this  way  we  can  control  the  size  of  the  updates  for 
Xi  and  Xi,  and  we  avoid  large  errors  (cf.  (6.5g)):  for 
a  proper  choice  of  the  7r(j),  the  x'^  will  be  small  even 
if  some  of  the  wj  are  large. 

In  the  modification  of  the  algorithms  that  we  will 
propose  in  Algorithm  2,  we  kept  in  mind  that  we  only 
may  allow  errors  which 

(a)  are  small  with  respect  to  the  initial  residual  tq 
(otherwise  accuracy  will  be  disturbed)  and 

(b)  are  small  with  respect  to  the  present  updated 
residual  r,  (otherwise  local  bi-orthogonality  may  be 
jeopardized). 
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In  Section  3,  we  have  explained  that  it  is  no  re¬ 
striction  to  take  xo  —  0,  arguing  that  this  situation 
can  be  forced  simply  by  a  shift:  shift  x  by  xq,  and  b 
by  Axq.  This  shift  can  be  made  explicit  in  the  hybrid 
Bi-CG  algorithms  by  making  three  changes: 

(i)  adding  as  a  last  line  to  the  initialization  phase 

X  =  Xo;  x'  =  0;  h'  =  ro; 


(ii)  adding  as  a  last  line  in  the  algorithms  (just  after 
‘end’) 

X  =  X  -\-  x'  \ 


(iii)  replacing  all  Xi  (and  x)  by  x'  (skipping  the  index 
*■)■ 

Even  in  rounded  arithmetic,  this  modification  will  not 
change  the  value  of  any  of  the  vectors  and  scalars  in 
the  computational  scheme,  except  for  the  x’s.  Since 
x-\-x'  is  the  approximation  that  we  are  interested  in, 
one  also  may  want  to  change  the  termination  crite¬ 
rion.  We  propose  to  replace  the  line 

if  X  is  accurate  enough  then  quit; 


by 

if  ||rj+i||  is  small  enough  then  quit; 

To  allow  for  a  more  accurate  way  of  updating  of  the 
residual  and  the  approximation,  we  suggest  to  add 
another  few  lines  just  before  ‘end’  in  the  algorithm, 
as  is  shown  in  Algorithm  2.  We  suggest  to  replace  the 


X  =  Xq]  x'  =  0;  b'  —  ro] 
for  i  =  0, 1,  2, ... 

•  Replace  all  Xi  and  x  by  x' . 

if  r,+i  is  small  enough  then  quit; 
set  ‘compute_res’  and  ‘updaie_app’] 
if  ‘computejres’  \s  true 
r,+i  -b'-  Ax'] 
if  ‘update_app’  is  true 

X  =  x  +  x']  x'=  0]  b'=  ri+i; 
endif 
endif 
endfor 
X  =  X  +  x'] 


Alg.2:  For  accurate  approximations. 

updated  residual  by  the  true  one  on  strategically  cho¬ 
sen  steps  (we  have  to  explain  when  the  value  of  the 
boolean  functions  ‘compuie_res’  is  true).  However, 
we  also  suggest  to  shift  the  problem  once  in  a  while 
(when  the  boolean  function  ‘update_app’  is  true)  in 
order  to  let  the  right-hand  decrease  (cf.  (7.1a)).  Here 
we  use  the  fact  that,  in  exact  arithmetic,  also  these 


intermediate  shifts  do  not  change  the  iteration  pa¬ 
rameters  and  vectors  (except  for  the  vectors  x).  Ob¬ 
serve  that  the  updated  residual  r,+i  is  replaced  by 
the  true  residual  b'  —  Ax'  of  the  shifted  problem  if 
‘compute_res’  is  true. 

For  this  we  propose  the  following  strategy. 

Update  X  and  b'  only  if  the  residual  is  significantly 
smaller  than  the  initial  residual,  while  an  interme¬ 
diate  residual  was  larger  (cf.  (7.1b),  (7.1c)  and  re¬ 
minder  (a)): 

‘update_app’  =  true 

(7.1d)  if  ||ri+i||<||6|i/100  L  ||6||</r 

else  ‘update_app’  =  false, 

where  fi  =  max||r,j|  and  the  maximum  is  taken  over 
all  residuals  since  the  previous  update  of  x  and  b' 
(since  the  previous  ‘updaie  app’  is  true). 

The  bound  in  (7.1a)  suggests  that  the  norm  ||6||  of  the 
initial  residual  should  be  used  as  criterion  for  shifting 
the  problem  {‘update_app’  is  true  if  ||ri+i]|  <  ||6||  & 
ll*’!||  ^  II^ID-  However,  if  the  process  converges  irreg¬ 
ularly  this  would  lead  to  many  shifts.  The  relaxed 
version  in  (7. Id)  turns  out  to  work  equally  well  at 
less  costs. 

Compute  a  true  residual  whenever  ‘compute_res’  is 
true  and  if  a  previous  residual  is  larger  than  the  ini¬ 
tial  residual  and  significantly  larger  than  the  present 
updated  residual: 

‘compute_res’  —  true 

if  lln+ili  <  M/100  ||6||<M 

'  '  '  or  ‘update_app  ’  is  true 

else  ‘ compute_res ’=  fa\se, 

where  M  ~  max||rj||  and  the  maximum  is  taken  over 
all  residuals  since  the  last  computation  of  the  true 
residual. 

Replacing  the  updated  residual  by  the  true  one 
perturbs  the  recursion  for  the  residuals.  If  the 
residual  decreases  too  much  since  the  previous  re¬ 
placement,  the  perturbation  may  become  large  rela¬ 
tively  to  the  present  residual  (reminder  (b)).  There¬ 
fore,  ‘compute  res’  may  be  true  more  often  than 
‘updaieapp 

We  suggest  to  add  the  above  strategy  to  an  existing 
code.  That  means  that  an  additional  matrix-vector 
multiplication  has  to  be  performed  whenever  a  true 
residual  has  to  be  computed.  The  conditions  (7. Id) 
and  (7.1e)  are  chosen  as  to  minimize  the  number  of 
these  additional  computations.  One  also  may  try  to 
skip  a  matrix-vector  multiplication  in  one  of  the  pre¬ 
ceding  lines  of  the  algorithm,  which  requires  some  ad¬ 
ditional  care  for  BiCGstab(.^),  but  which  easily  can 
be  accomplished  for  CGS. 

If  CGS  is  modified  as  suggested,  then  the  new  lines 
do  not  require  additional  matrix  vector  multiplica¬ 
tions,  and  there  is  no  need  to  restrict  the  number  of 
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computations  of  true  local  residuals.  For  this  CGI 
variant,  Neumaier  [56]  suggested  places  where  the  ; 
and  b'  can  be  updated  for  accurate  approximations 
update  X  and  b'  whenever  the  residual  decreases  witl 
respect  to  the  previous  ‘best  residual’, 

‘update_app’  =  true 

(7.if)  if  ih+ill  <  Ill'll 

else  ‘compute_res’ =  fahe. 

The  modifications  according  to  Neumaier’s  approacl 
are  given  in  Algorithm  3.  Observe  that  the  norm  o 


X  =  XQ-,  x'  =  0;  b'  -  ro;  fi'  -  ||6'||; 
for  i  =  0, 1,  2, ... 

•  Replace  all  Xi  and  x  by  x' . 

Skip  the  CGS  update  for  r 
together  with  the  MV  involved 
in  this  update.  Compute  instead 
Ti+i  -b'  -  Ax'-  /i  =  ||ri+i||; 
if  pi  is  small  enough  then  quit; 
if  /i  <  /z' 

X  =  X  +  x'-,  x'  —  O', 

6'  =  ri+i;  fx' =  R-, 
endif 
endfor 
x  =  X  +  x'; 


Alg.3:  Neumaier’s  strategy  for  CGS. 

the  b'  (the  residuals  with  respect  to  the  x)  strictly 
decrease:  the  Neumaier  trick  also  smoothes  conver¬ 
gence  (without  improving  its  speed!). 

Below,  we  discuss  the  effects  of  our  strategies  in 
practise.  We  illustrate  our  observations  by  a  simple 
numerical  example. 

Example.  Figure  17  shows  the  convergence  history  of 
the  true  residuals  as  produced  by  standard  CGS,  and 
by  the  modified  versions  of  CGS  as  suggested  above, 
applied  to  the  SHERMAN4  matrix  of  the  Harwell- 
Boeing  collection  (as  in  the  example  of  Section  ??). 
The  dotted  curve  (•••)  represents  the  results  for  stan¬ 
dard  CGS.  We  also  applied  modified  CGS  as  in  Algo¬ 
rithm  2,  using  the  update  criterions  (7. Id)  and  (7.1e). 

The  solid  curve  ( - )  represents  the  results  for  this 

simple  strategy,  while  the  dashed-dotted  curve  ( - ) 

represents  the  results  for  Neumaier’s  strategy  in  Algo¬ 
rithm  3.  On  log-scale,  the  norm  of  the  true  residuals 
||6  -  Axi\\,  ||6  -  A(x  +  x')\\,  respectively,  is  plotted 
against  the  number  of  matrix-vector  multiplications. 
Neumaier’s  strategy  as  well  our’s  lead  to  approxima¬ 
tions  that  are  accurate  (cf.  (7.1a)):  comparing  ||rol| 
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with  the  norm  of  the  smallest  true  residual,  we  see 
that  a  reduction  is  obtained  by  a  factor  fa  10“^’^ 
(f  =  2.210“^®).  Standard  CGS  does  not  produce 
true  residuals  smaller  than  fa  10“®||ro||,  which  is  ap¬ 
proximately  f  ■  max||ri||  fa  2.210"^®  •  10^||ro||;  cf. 
(6.5g).  Observe  that,  though  the  convergence  his¬ 
tories  do  not  coincide  for  residuals  less  than  fa  10^, 
the  speed  of  convergence  is  not  affected:  the  modi¬ 
fied  versions  exhibit  a  rate  of  convergence  that  is  very 
similar  to  the  one  of  the  updated  residuals  in  stan¬ 
dard  CGS  as  shown  in  Figure  3. 

Experiments  for  other  examples  and  with  other  it¬ 
erative  schemes,  as  Bi-CGSTAB  and  BiCGstab(£), 
led  to  similar  conclusions.  Although,  two  observa¬ 
tions  should  be  made. 

—  Quite  often  the  improvements  are  much  more  spec¬ 
tacular  than  for  this  SHERMAN4  example:  CGS 
may  produce  intermediate  residuals  as  large  as  ||ro||/? 
and  none  of  the  digits  in  the  finial  approximation  of 
standard  CGS  will  be  correct. 

—  There  are  some  differences  between  CGS  and 
the  BiCGstab  methods:  (i)  as  observed  above,  Neu¬ 
maier’s  strategy  only  works  well  for  CGS,  while  the 
simple  strategy  of  Algorithm  2  can  always  be  applied, 
(ii)  Especially  for  the  BiCGstab  methods,  the  sim¬ 
ple  strategy  of  Algorithm  2  with  update  criterions 
(7. Id)  and  (7.1e)  does  not  lead  to  much  additional 
work.  The  additional  computation  of  a  true  resid¬ 
ual  takes  place  after  the  process  encounters  residuals 
that  are  (much)  larger  than  the  initial  residual.  Since 
BiCGstab  (^)  tends  to  show  much  smoother  conver¬ 
gence  behavior  than  CGS,  for  small  t,  the  additional 
work  in  these  methods  is  usually  much  less  than  for 
CGS.  In  the  SHERMAN4  example,  our  strategy  for 
CGS  requires  7  additional  matrix-vector  multiplica¬ 
tions  (‘compute_res’  is  true  7  times)  and  one  spe¬ 
cial  update  of  the  approximation  ( ‘update_app’  is  true 
only  once).  For  BiCGstab(^),  ^  <  6,  only  1  additional 
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matrix-vector  multiplication  was  needed.  Neumaier’s 
strategy  for  CGS  does  not  require  additional  matrix- 
vector  multiplications  (but  364  additional  updates  for 
the  approximation  were  needed). 

8  Termination  Criteria 

An  important  point,  when  using  iterative  processes, 
is  to  decide  when  to  terminate  the  process.  Popular 
stopping  criteria  are  based  on  the  norm  of  the  current 
residual,  or  on  the  norm  of  the  update  to  the  current 
approximation  to  the  solution  (or  a  combination  of 
these  norms).  More  sophisticated  criteria  have  been 
discussed  in  litterature. 

In  [45]  a  practical  termination  criterion  for  the  conju¬ 
gate  gradient  method  is  considered.  Suppose  we  want 
an  approximation  x,  for  the  solution  x  for  which 

\\xi  -  x||2/||a;||2  <  £, 

where  £  is  a  tolerance  set  by  the  user. 

It  is  shown  in  [45]  that  such  an  approximation  is  ob¬ 
tained  by  CG  as  soon  as 

Ikdb  <  +  e), 

where  pL\  stands  for  the  smallest  eigenvalue  of  the 
positive  definite  symmetric  (preconditioned)  matrix 
A.  Of  course,  in  most  applications  the  value  for  /xi 
will  be  unknown,  but  with  the  iteration  coefficients 
of  CG  we  can  build  the  tridiagonal  matrix  Tj,  and 
compute  the  smallest  eigenvalue  (Ritz  value)  of 
Ti,  which  is  an  approximation  for  In  [45]  a  simple 
algorithm  for  the  computation  of  along  with  the 
CG  algorithm,  is  described,  and  it  is  shown  that  a 
rather  robust  stopping  criterion  is  formed  by 

lk*l|2  <  A'i*^lkt||2£/(H-£)- 
A  similar  criterion  has  also  been  suggested  earlier  in 
[40]. 

A  quite  different,  but  much  more  generally  appli¬ 
cable  approach  has  been  suggested  in  [1].  In  this  ap¬ 
proach  the  approximate  solution  of  an  iterative  pro¬ 
cess  is  regarded  as  the  exact  solution  of  some  (nearby) 
linear  system,  and  computable  bounds  for  the  pertur¬ 
bations  with  respect  to  the  given  system  are  given. 
A  nice  overview  of  termination  criteria  has  been  pre¬ 
sented  in  [6];  Section  4.2. 

9  Implementation  Aspects 

For  effective  use  of  the  given  iteration  schemes,  it  is 
necessary  that  they  can  be  implemented  such  that 
high  computing  speeds  are  achievable.  It  is  most 
likely  that  high  computing  speeds  will  be  realized 
only  by  parallel  architectures  and  therefore  we  must 
see  how  well  iterative  methods  fit  to  such  computers. 

The  iterative  methods  only  need  a  handful  of  basic 
operations  per  iteration  step 


•  Vector  updates:  in  each  iteration  step  the  cur¬ 
rent  approximation  to  the  solution  is  updated 
by  a  correction  vector.  Often  the  corresponding 
residual  vector  is  also  obtained  by  a  simple  up¬ 
date,  and  we  have  update  formulas  as  well  for 
the  correction  vector  (or  search  direction). 

•  Innerproducts:  In  many  methods  the  speed 
of  convergence  is  influenced  by  carefully  con¬ 
structed  iteration  coefficients.  These  coefficients 
are  sometimes  known  analytically,  but  more  of¬ 
ten  they  are  computed  by  innerproducts,  involv¬ 
ing  residual  vectors  and  search  directions,  as  in 
the  methods  discussed  in  the  previous  sections. 

•  Matrix  vector  products:  In  each  step  at  least  one 
matrix  vector  product  has  to  be  computed  with 
the  matrix  of  the  given  linear  system.  Sometimes 
also  matrix  vector  products  with  the  transpose 
of  the  given  matrix  are  required  (e.g.,  BiCG). 
Note  that  it  is  not  necessary  to  have  the  matrix 
explicitly,  it  suffices  to  be  able  to  generate  the 
result  of  the  matrix  vector  product. 

•  Preconditioning:  It  is  common  practice  to  pre¬ 
condition  the  given  linear  system  by  some  pre¬ 
conditioning  operator.  Again  it  is  not  neces¬ 
sary  to  have  this  operator  in  explicit  form,  it 
is  enough  to  generate  the  result  of  the  operator 
aplied  to  some  given  vector.  The  preconditioner 
is  applied  as  often  as  the  matrix  vector  multiply 
in  each  iteration  step. 

For  problem  sizes  large  enough  the  innerproducts, 
vectorupdates  and  matrix  vector  product  are  easily 
parallelized  and  vectorized.  The  more  successful  pre¬ 
conditionings,  i.e,  based  upon  incomplete  LU  decom¬ 
position,  are  not  easily  parallelizable.  For  that  rea¬ 
son  one  is  often  satisfied  with  the  use  of  only  diagonal 
scaling  as  a  preconditioner  on  highly  parallel  comput¬ 
ers,  such  as  the  CM2  [7]. 

On  distributed  memory  computers  we  need  large 
grained  parallelism  in  order  to  reduce  synchroniza¬ 
tion  overhead.  This  can  be  achieved  by  combining 
the  work  required  for  a  successive  number  of  itera¬ 
tion  steps.  The  idea  is  to  construct  first  in  parallel 
a  straight  forward  Krylov  basis  for  the  search  sub¬ 
space  in  which  an  update  for  the  current  solution  will 
be  determined.  Once  this  basis  has  been  computed, 
the  vectors  are  orthogonalized,  as  is  done  in  Krylov 
subspace  methods.  The  construction  as  well  as  the 
orthogonalization  can  be  done  with  large  grained  par¬ 
allelism,  and  has  sufficient  degree  of  parallelism  in  it. 
This  approach  has  been  suggested  for  CG  in  [11]  and 
for  GMRES  in  [12],  [5]  and  [18].  One  of  the  disad¬ 
vantages  in  this  approach  is  that  a  straight  forward 
basis,  of  the  form  y,Ay,A’^y,...,A‘‘y  is  usually  very 
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ill-conditioned.  This  is  in  sharp  contrast  to  the  opti¬ 
mal  condition  of  the  orthogonal  basis  set  constructed 
by  most  of  the  projection  type  methods  and  it  puts 
severe  limits  on  the  number  of  steps  that  can  be  com¬ 
bined.  However,  in  [5]  and  [18]  ways  to  improve  the 
condition  of  a  parallel  generated  basis  are  suggested 
and  it  seems  possible  to  take  larger  numbers  of  steps, 
say  25,  together.  In  [18]  the  effects  of  this  approach 
on  the  communication  overhead  are  studied  and  com¬ 
pared  with  experiments  done  on  moderately  massive 
parallel  transputer  systems. 

9.1  Parallelism  in  the  preconditioner: 

In  this  section  we  consider  a  number  of  possibili¬ 
ties  to  obtain  parallelism  in  the  standard  Incomplete 
Choleski  preconditioner  [51].  The  linear  systems  are 
supposed  to  arise  from  standard  finite  difference  dis¬ 
cretisations  of  second  order  pde’s  over  rectangular 
grids  in  two  or  three  dimensional  space. 

9.1.1  Overlapping  Local  Preconditioners 

Radicati  di  Brozolo  and  Robert  [66]  suggest  to  par¬ 
tition  the  given  matrix  A  in  (slightly)  overlapping 
blocks  along  the  main  diagonal.  Note  that  a  given 
non-zero  entry  of  A  is  not  necessarily  contained  in 
one  of  these  blocks.  But  experience  suggests  that 
this  approach  is  more  successful  if  these  blocks  cover 
all  the  non-zero  entries  of  A.  The  idea  is  to  compute 
in  parallel  local  preconditioners  for  all  of  the  blocks-, 

e.g., 

(9.1a)  An  =  LnDn~^Un-  Rn- 

Then,  when  solving  Kw  —  r  m  the  preconditioning 
step,  we  partition  r  in  (overlapping)  parts  r„,  accord¬ 
ing  to  An,  and  we  solve  the  systems  LnD~^UnWn  = 
r„  in  parallel.  Finally  we  define  the  elements  of  w  to 
be  equal  to  corresponding  elements  of  the  Wn’s  in  the 
nonoverlapping  parts  and  to  the  average  of  them  in 
the  overlapped  parts. 

Radicati  di  Brozolo  and  Robert  [66]  report  on  tim¬ 
ing  results  obtained  on  an  IBM3090-600E/VF  for 
GMRES  preconditioned  by  overlapped  incomplete 
LU  decomposition  for  a  2D  system  of  order  32400 
with  a  bandwidth  of  360.  For  p  processors  (1  < 
p  <  6)  they  subdivide  Amp  overlapping  parts,  the 
overlap  being  so  large  that  thses  blocks  cover  all  the 
nonzero  entries  of  A.  They  found  experimentally  an 
overlap  of  about  360  elements  to  be  optimal  for  their 
problem.  This  approach  led  to  a  speedup  of  roughly 
p.  In  some  cases  a  speedup  even  slightly  larger  than  p 
was  observed,  apparantly  due  to  the  fact  that  the  par¬ 
allel  preconditioner  was  slightly  more  effective  than 
the  standard  one  in  those  cases. 

9.1.2  Repeated  Twisted  Factorization 

Meurant  [54]  reports  on  timing  results  obtained  with 
a  CRAY  Y-MP/832,  using  an  incomplete  repeated 


twisted  block  factorization  for  2D  problems.  In  his 
experiments  the  L  of  the  incomplete  factorization  has 
a  block  structure,  i.e.,  L  has  alternatingly  a  block  be¬ 
low  the  diagonal,  one  above,  one  below,  and  it  ends 
with  one  above  the  diagonal.  For  this  approach  Meu¬ 
rant  reports  a  speedup,  for  preconditioned  CG,  close 
to  6  on  the  8-processor  CRAY  Y-MP.  This  speedup 
has  been  measured  relative  to  the  same  repeated 
twisted  factorization  process  executed  on  one  single 
processor.  Meurant  also  reports  an  increase  in  the 
number  of  iteration  steps,  due  to  this  repeated  twist¬ 
ing.  This  implies  that  the  effective  speedup  with  re¬ 
spect  to  the  nonparallel  code  is  only  about  4, 

9.1.3  Twisted  and  Nested  Twisted  Factoriza¬ 
tion 

For  3D  problems  we  have  used  the  blockwise  twisted 
approach  [23]  in  the  2-  direction,  i.e.  the  (x,  j/)-planes 
in  the  grid  were  treated  in  parallel  from  bottom  and 
top  inwards.  Over  each  plane  we  used  the  diagonal- 
wise  ordering,  in  order  to  achieve  high  vector  speeds 
on  each  processor. 

On  a  dedicated  CRAY  X-MP/2  this  led,  for  precondi¬ 
tioned  CG,  to  a  reduction  by  a  factor  of  close  to  2  in 
wall  clock  time  with  respect  to  the  CPU  time  for  the 
nonparallel  code  on  one  single  processor.  For  the  mi- 
crotasked  code  the  wall  clock  time  on  the  2-processor 
system  was  measured  for  a  dedicated  system,  whereas 
for  the  nonparallel  code  the  CPU  time  was  measured 
on  a  moderately  loaded  system.  In  some  situations 
the  speedup  was  even  slightly  larger  than  2,  due  to 
better  convergence  properties  of  the  twisted  incom¬ 
plete  preconditioner. 

The  effects  of  these  and  other  orderings  on  the  conver¬ 
gence  of  preconditioned  methods  and  on  the  amount 
of  parallelism  have  been  studied  in  [21]. 

We  can  also  apply  the  twisted  incomplete  factor¬ 
ization  in  a  nested  way  [83].  For  3D  problems  this 
can  be  exploited  by  twisting  also  the  blocks  corre¬ 
sponding  to  (x,  y)  planes  in  the  y-direction.  Over  the 
resulting  blocks,  corresponding  to  half  {x,y)  planes, 
we  may  apply  diagonal  ordering  in  order  to  fully  vec¬ 
torize  the  four  parallel  parts. 

By  this  approach  we  have  been  able  to  reduce  the 
wall  clock  time  by  a  factor  of  3.3,  for  preconditioned 
CG,  on  the  4-processor  CONVEX  C-240.  In  this  case 
the  total  CPU  time,  used  by  all  of  the  processors,  is 
roughly  equal  to  the  CPU  time  required  for  single 
processor  execution  [85].  Other  then  for  the  exper¬ 
iments  on  the  CRAY  X-MP/2,  as  reported  before, 
we  have  relied  on  the  autotasking  capabilities  of  the 
Fortran  compiler  for  the  C-240,  for  all  of  the  code,  ex¬ 
cept  for  the  preconditioning  part.  Since  some  state¬ 
ments  in  the  code  lead  to  rather  short  vector  lengths, 
this  may  explain  partially  why  the  factor  3.3  for  the 
CONVEX  C-240  stays  well  behind  the  theoretically 


3-35 


expected  factor  of  about  3.9.  Another  reason  might 
be  that  we  were  not  completely  sure  whether  our  test¬ 
ing  machine  was  executing  constantly  in  stand  alone 
mode  during  the  time  of  our  timing  experiments. 
Even  the  system  itself  needs  some  CPU-time  from 
time  to  time. 

9.1.4  Hyperplane  Ordering 

For  a  CYBER  205  it  has  been  reported  how  to  ob¬ 
tain  long  vectorlengths  for  certain  3D  situations  ([23], 
[73]),  and,  of  course,  this  approach  can  also  be  fol¬ 
lowed  in  order  to  obtain  parallelism.  This  has  been 
done  by  Berryman  et.al.  [7]  for  parallelizing  stan¬ 
dard  ICCG  on  a  Connection  Machine  CM-2.  For  a 
4K  processor  machine  they  report  a  computational 
speed  of  52.6  Mflops  for  the  (sparse)  matrix  vector 
product,  while  13.1  Mflops  has  been  realized  for  the 
preconditioner,  using  the  hyperplane  approach. 

This  reduction  in  speed  by  a  factor  of  4  makes  it 
attractive  to  use  only  diagonal  scaling  as  a  precondi¬ 
tioner  in  some  situations,  for  massively  parallel  ma¬ 
chines  like  the  CM-2.  The  latter  approach  has  been 
followed  by  Mathur  and  Johnsson  [48]  for  finite  ele¬ 
ment  problems. 

We  have  used  the  hyperplane  ordering  for  precon¬ 
ditioned  CG  on  an  ALLIANT  FX/4,  for  3D  systems 
with  dimensions  n,,  =  40,  Uy  =  39  and  =  30.  For 
4  processors  this  led  to  a  speedup  of  2.61,  to  be  com¬ 
pared  with  a  speedup  of  2.54  for  the  CG-process  with 
only  diagonally  scaling  as  a  preconditioner.  The  fact 
that  both  speedups  are  quite  far  below  the  optimal 
value  of  4,  must  be  attributed  to  cache  effects  [85]. 
These  cache  effects  can  be  largely  removed,  when  us¬ 
ing  the  reduced  system  approach  suggested  by  Meier 
and  Sameh  [49].  However,  for  the  3D  systems  that  we 
have  tested  sofar,  the  reduced  system  approach  led, 
in  average,  to  about  the  same  CPU  times  as  for  the 
hyperplane  approach,  on  Alliant  FX/8  and  FX/80 
computers. 

10  * 

References 

[1]  M.  Arioli,  I.  S.  Duff,  and  D.  Ruiz.  Stopping 
criteria  for  iterative  solvers.  SIAM  J.  Matrix 
Anal.  AppL,  13:138-144,  1992. 

[2]  O.  Axelsson.  Solution  of  linear  systems  of  equa¬ 
tions:  iterative  methods.  In  V.  A.  Barker,  editor. 
Sparse  Matrix  Techniques,  Berlin,  1977.  Copen¬ 
hagen  1976,  Springer  Verlag. 

[3]  O.  Axelsson.  Conjugate  gradient  type  meth¬ 
ods  for  unsymmetric  and  inconsistent  systems 
of  equations.  Lin.  Alg.  and  its  AppL,  29T-16 
1980. 


[4]  0.  Axelsson  and  P.  S.  Vassilevski.  A  black 
box  generalized  conjugate  gradient  solver  with 
inner  iterations  and  variable-step  precondition¬ 
ing.  SIAM  J.  Matrix  Anal.  AppL,  12(4):625-644 
1991. 

[5]  Zhaojun  Bai,  Dan  Hu,  and  Lothar  Reichel.  A 
Newton  basis  GMRES  implementation.  Techni¬ 
cal  Report  91-03,  University  of  Kentucky,  1991. 

[6]  R.  Barrett,  M.  Berry,  T.  Chan,  J.  Demmel, 
J.  Donato,  J.  Dongarra,  V.  Eijkhout,  R.  Pozo, 
C.  Romine,  and  H.  van  der  Vorst.  Templates  for 
the  Solution  of  Linear  Systems:  Building  Blocks 
for  Iterative  Methods.  SIAM,  Philadelphia,  PA 
1994. 

[7]  H.  Berryman,  J.  Saltz,  W.  Gropp,  and  R.  Mir- 
chandaney.  Krylov  methods  preconditioned  with 
incompletely  factored  matrices  on  the  CM-2. 
Technical  Report  89-54,  NASA  Langley  Re¬ 
search  Center,  ICASE,  Hampton,  VA,  1989. 

[8]  A.  Bjorck  and  T.  Elfving.  Accelerated  projection 
methods  for  computing  pseudo-inverse  solutions 
of  systems  of  linear  equations.  BIT,  19:145-163 
1979. 

[9]  P.  N.  Brown.  A  theoretical  comparison  of  the 
Arnoldi  and  GMRES  algorithms.  SIAM  J.  Sci. 
Statist.  Comput.,  12:58-78,  1991. 

[10]  G.  Brussino  and  V.  Sonnad.  A  comparison  of  di¬ 
rect  and  preconditioned  iterative  techniques  for 
sparse  unsymmetric  systems  of  linear  equations. 
Int.  J.  for  Num.  Methods  in  Eng.,  28:801-815 
1989. 

[11]  A.  T.  Chronopoulos  and  C.  W.  Gear.  s-Step 
iterative  methods  for  symmetric  linear  systems. 

J.  on  Comp,  and  AppL  Math.,  25:153-168, 1989. 

[12]  A.  T.  Chronopoulos  and  S.  K.  Kim.  s-Step 
Orthomin  and  GMRES  implemented  on  paral¬ 
lel  computers.  Technical  Report  90/43R,  UMSI, 
Minneapolis,  1990. 

[13]  P.  Concus  and  G.  H.  Golub.  A  general¬ 
ized  Conjugate  Gradient  method  for  nonsym- 
metric  systems  of  linear  equations.  Technical 
Report  STAN-CS-76-535,  Stanford  University, 
Stanford,  CA,  1976. 

[14]  P.  Concus,  G.  H.  Golub,  and  D.  P.  O’Leary.  A 
generalized  conjugate  gradient  method  for  the 
numerical  solution  of  elliptic  partial  differential 
equations.  In  J.  R.  Bunch  and  D.  J.  Rose,  ed¬ 
itors,  Sparse  Matrix  Computations.  Academic 
Press,  New  York,  1976. 


3-36 


[15]  G.  C.  (Lianne)  Crone.  The  conjugate  gradient 
method  on  the  Parsytec  GCel-3/512.  to  appear 
in  FGCS. 

[16]  L.  Crone  and  H.  van  der  Vorst.  Communica¬ 
tion  aspects  of  the  conjugate  gradient  method  on 
distributed-memory  machines.  Supercomputer, 
X(6):4-9,  1993. 

[17]  E.  de  Sturler.  A  parallel  restructured  version 
of  GMRES(m).  Technical  Report  91-85,  Delft 
University  of  Technology,  Delft,  1991. 

[18]  E.  de  Sturler.  A  parallel  variant  of  GMRES(m). 
In  R.  Miller,  editor,  Proc.  of  the  fifth  Int.Symp. 
on  Numer.  Methods  in  Eng.,  1991. 

[19]  E.  De  Sturler  and  D.  R.  Fokkema.  Nested  Krylov 
methods  and  preserving  the  orthogonality.  In 
N.  Duane  Melson,  T.A.  Manteuffel,  and  S.F.  Mc¬ 
Cormick,  editors.  Sixth  Copper  Mountain  Con¬ 
ference  on  Multigrid  Methods,  volume  Part  1  of 
NASA  Conference  Publication  3324,  pages  111- 
126.  NASA,  1993. 

[20]  J.  Demmel,  M.  Heath,  and  H.  van  der  Vorst. 
Parallel  linear  algebra.  In  Acta  Numerica  1993. 
Cambridge  University  Press,  Cambridge,  1993. 

[21]  Shun  Doi.  On  parallelism  and  convergence  of  in¬ 
complete  LU  factorizations.  Appl.  Num.  Math., 
7:417-436,  1991. 

[22]  J .  J .  Dongarra.  Performance  of  various  comput¬ 
ers  using  standard  linear  equations  software  in 
a  fortran  environment.  Technical  Report  CS-89- 
85,  University  of  Tennessee,  Knoxville,  1990. 

[23]  J.  J.  Dongarra,  I.  S.  Duff,  D.  C.  Sorensen,  and 
H.  A.  van  der  Vorst.  Solving  Linear  Systems  on 
Vector  and  Shared  Memory  Computers.  SIAM, 
Philadelphia,  PA,  1991. 

[24]  Jack  J.  Dongarra  and  Henk  A.  van  der  Vorst. 
Performance  of  various  computers  using  stan¬ 
dard  sparse  linear  equations  solving  techniques. 
Supercomputer,  9(5):  17-29,  1992. 

[25]  I.  S.  Duff,  A.  M.  Erisman,  and  J.K.Reid.  Direct 
methods  for  sparse  matrices.  Oxford  University 
Press,  London,  1986. 

[26]  T.  Eirola  and  O.  Nevanlinna.  Accelerating  with 
rank-one  updates.  Lin.  Alg.  and  its  Appl, 
121:511-520,  1989. 

[27]  H.  C.  Elman.  Iterative  methods  for  large  sparse 
nonsymmetric  systems  of  linear  equations.  PhD 
thesis,  Yale  University,  New  Haven,  CT,  1982. 


[28]  R.  Fletcher.  Conjugate  gradient  methods  for  in¬ 
definite  systems,  volume  506  of  Lecture  Notes 
Math.,  pages  73-89.  Springer- Verlag,  Berlin- 
Heidelberg-New  York,  1976. 

[29]  D.  R.  Fokkema,  G.L.G.  Sleijpen  and  H.A. 
Van  der  Vorst.  Generalized  Conjugate  Gradient 
Squared.  Preprint  851,  Dept.  Math.,  University 
Utrecht,  1994. 

[30]  R.  W.  Freund,  M.  H.  Gutknecht,  and  N.  M. 
Nachtigal.  An  implementation  of  the  look-ahead 
Lanczos  algorithm  for  non-Hermitian  matrices. 
SIAM  J.  Sci.  Comput.,  14:137-158,  1993. 

[31]  R.  W.  Freund  and  N.  M.  Nachtigal.  An  imple¬ 
mentation  of  the  look-ahead  Lanczos  algorithm 
for  non-Hermitian  matrices,  part  2.  Technical 
Report  90.46,  RIACS,  NASA  Ames  Research 
Center,  1990. 

[32]  R.  W.  Freund  and  N.  M.  Nachtigal.  QMR: 
a  quasi-minimal  residual  method  for  non- 
Hermitian  linear  systems.  Num.  Math.,  60:315- 
339,  1991. 

[33]  R.  W.  Freund.  A  transpose-free  quasi-minimal 
residual  algorithm  for  non-Hermitian  linear  sys¬ 
tems.  SIAM  J.  Sci.  Comput,  14:470-482,  1993. 

[34]  G.  H.  Golub  and  D.P.  O’Leary.  Some  history  of 
the  conjugate  gradient  and  lanczos  algorithms: 
1948-1976.  SIAM  Review,  31:50-102,  1989. 

[35]  G.  H.  Golub  and  C.  F.  van  Loan.  Matrix  Compu¬ 
tations.  North  Oxford  Academic,  Oxford,  1983. 

[36]  G.  H.  Golub  and  C.  F.  van  Loan.  Matrix  Com¬ 
putations.  The  Johns  Hopkins  University  Press, 
Baltimore,  1989. 

[37]  I.  Gustafsson.  A  class  of  first  order  factorization 
methods.  BIT,  18:142-156,  1978. 

[38]  M.  H.  Gutknecht.  Variants  of  BICGSTAB  for 
matrices  with  complex  spectrum.  SIAM  J.  Sci. 
Comput,  14:1020-1033,  1993. 

[39]  W.  Hackbusch.  Iterative  Losung  grofier 
schwachbesetzter  Cleichungssysteme.  Teubner, 
Stuttgart,  1991. 

[40]  L.  A.  Hageman  and  D.  M.  Young.  Applied  Itera¬ 
tive  Methods.  Academic  Press,  New  York,  1981. 

[41]  M.  R.  Hestenes  and  E.  Stiefel.  Methods  of  con¬ 
jugate  gradients  for  solving  linear  systems.  J. 
Res.  Natl.  Bur.  Stand.,  49:409-436,  1954. 

[42]  C.  P.  Jackson  and  P.  C.  Robinson.  A  numerical 
study  of  various  algorithms  related  to  the  pre¬ 
conditioned  Conjugate  Gradient  method.  Int.  J. 
for  Num.  Meth.  in  Eng.,  21:1315-1338,  1985. 


3-37 


[43]  D.  A.  H.  Jacobs.  Preconditioned  Conjugate  Gra¬ 
dient  methods  for  solving  systems  of  algebraic 
equations.  Technical  Report  RD/L/N  193/80, 
Central  Electricity  Research  Laboratories,  1981. 

[44]  K.  C.  Jea  and  D.  M.  Young.  General¬ 
ized  conjugate-gradient  acceleration  of  nonsym- 
metrizable  iterative  methods.  Lin.  Algebra 
AppL,  34:159-194,  1980. 

[45]  E.  F.  Kaasschieter.  A  practical  termination  cri¬ 
terion  for  the  Conjugate  Gradient  method.  BIT, 
28:308-322,  1988. 

[46]  C.  Lanczos.  An  iteration  method  for  the  solu¬ 
tion  of  the  egenvalue  problem  of  linear  differen¬ 
tial  and  integral  operators.  J.  Res.  Natl.  Bur. 
Stand,  45:225-280,  1950. 

[47]  C.  Lanczos.  Solution  of  systems  of  linear  equa¬ 
tions  by  minimized  iterations.  J.  Res.  Natl.  Bur. 
Stand,  49:33-53,  1952. 

[48]  K.  K.  Mathur  and  S.  L.  Johnsson.  The  finite 
element  method  on  a  data  parallel  computing 
system.  Technical  Report  CS  89-2,  Thinking 
Machines  Corporation,  1989.  to  appear  in  In¬ 
ternational  Journal  of  High-Speed  Computing. 

[49]  U.  Meier  and  A.  Sameh.  The  behavior  of  con¬ 
jugate  gradient  algorithms  on  a  multivector  pro¬ 
cessor  with  a  hierarchical  memory.  Technical  Re¬ 
port  CSRD  758,  University  of  Illinois,  Urbana, 
IL,  1988. 

[50]  U.  Meier  Yang.  Preconditioned  Conjugate 
Gradient-Like  Methods  for  Nonsymmetric  Lin¬ 
ear  Systems.  Preprint,  Center  for  Research  and 
Development,  University  of  Illinois  at  Urbana- 
Champaign,  1992. 

[51]  J.  A.  Meijerink  and  H.  A.  van  der  Vorst.  An 
iterative  solution  method  for  linear  systems  of 
which  the  coefficient  matrix  is  a  symmetric  M- 
matrix.  Math. Comp.,  31:148-162,  1977. 

[52]  G.  Meurant.  The  block  preconditioned  conju¬ 
gate  gradient  method  on  vector  computers.  BIT, 
24:623-633,  1984. 

[53]  G.  Meurant.  Numerical  experiments  for  the  pre¬ 
conditioned  conjugate  gradient  method  on  the 
CRAY  X-MP/2.  Technical  Report  LBL-18023, 
University  of  California,  Berkeley,  CA,  1984. 

[54]  G.  Meurant.  The  conjugate  gradient  method  on 
vector  and  parallel  supercomputers.  Technical 
Report  CTAC-89,  University  of  Brisbane,  July 
1989. 


[55]  N.  M.  Nachtigal,  S.  C.  Reddy,  and  L.  N.  Tre- 
fethen.  How  fast  are  nonsymmetric  matrix  itera¬ 
tions?  SIAM  J.  Matrix  Anal.  Appl,  13:778-795, 
1992. 

[56]  A.  Neumaier.  Oral  presentation  at  the  Oberwol- 
fach  meeting:  Numerical  Linear  Algebra,  Ober- 
wolfach,  1994. 

[57]  J.  M.  Ortega.  Introduction  to  Parallel  and  Vector 
Solution  of  Linear  Systems.  Plenum  Press,  New 
York  and  London,  1988. 

[58]  C.  C.  Paige.  Computational  variants  of  the  Lanc¬ 
zos  method  for  the  eigenproblem.  J.  Inst.  Math. 
Appl,  10:373-381,  1972. 

[59]  C.  C.  Paige,  B.  N.  Parlett,  and  H.  A.  van  der 
Vorst.  Approximate  solutions  and  eigenvalue 
bounds  from  Krylov  subspaces.  Num.  Lin.  Alg. 
with  Appl,  2:115-134,  1995. 

[60]  C.  C.  Paige  and  M.  A.  Saunders.  Solution 
of  sparse  indefinite  systems  of  linear  equations. 
SIAM  J.  Numer.  Anal,  12:617-629,  1975. 

[61]  C.  C.  Paige  and  M.  A.  Saunders.  LSQR:  An 
algorithm  for  sparse  linear  equations  and  sparse 
least  squares.  ACM  Trans.  Math.  Soft.,  8:43-71, 
1982. 

[62]  B.  N.  Parlett,  D.  R.  Taylor,  and  Z.  A.  Liu.  A 
look-ahead  Lanczos  algorithm  for  unsymmetric 
matrices.  Math.  Comp.,  44:105-124,  1985. 

[63]  Beresford  N.  Parlett.  The  Symmetric  Eigenvalue 
Problem.  Prentice-Hall,  Englewood  Cliffs,  N.J., 
1980. 

[64]  Claude  Pommerell.  Solution  of  large  unsym¬ 
metric  systems  of  linear  equations.  PhD  thesis, 
Swiss  Federal  Institute  of  Technology,  Zurich, 
1992. 

[65]  Claude  Pommerell  and  Wolfgang  Fichtner. 
PILS:  An  iterative  linear  solver  package  for  ill- 
conditioned  systems.  Technical  Report  91/5, 
ETH  Zurich,  1991. 

[66]  G.  Radicati  di  Brozolo  and  Y.  Robert.  Vector 
and  parallel  CG-like  algorithms  for  sparse  non¬ 
symmetric  systems.  Technical  Report  681-M, 
IMAG/TIM3,  Grenoble,  1987. 

[67]  G.  Radicati  di  Brozolo  and  Y.  Robert.  Paral¬ 
lel  conjugate  gradient-like  algorithms  for  solv¬ 
ing  sparse  non-symmetric  systems  on  a  vector 
multiprocessor.  Parallel  Computing,  11:223-239, 
1989. 


3-38 


[68]  Y.  Saad.  Practical  use  of  polynomial  precon¬ 
ditionings  for  the  conjugate  gradient  method. 
SIAM  J.  Sci.  Stat  Comput.,  6:865-881,  1985. 

[69]  Y.  Saad.  Krylov  subspace  methods  on  supercom¬ 
puters.  Technical  report,  RIACS,  Moffett  Field, 
CA,  September  1988. 

[70]  Y.  Saad.  A  flexible  inner-onter  preconditioned 
GMRES  algorithm.  SIAM  J.  Sci.  Comput., 
14:461-469,  1993. 

[71]  Y.  Saad  and  M.  H.  Schultz.  Conjugate  Gradient¬ 
like  algorithms  for  solving  nonsymmetric  linear 
systems.  Math,  of  Comp.,  44:417-424,  1985. 

[72]  Y.  Saad  and  M.  H.  Schultz.  GMRES:  a  general¬ 
ized  minimal  residual  algorithm  for  solving  non¬ 
symmetric  linear  systems.  SIAM  J.  Sci.  Statist. 
Comput.,  7:856-869,  1986. 

[73]  J.  J.  E.  M.  Schlichting  and  H.  A.  van  der  Vorst. 
Solving  3D  block  bidiagonal  linear  systems  on 
vector  computers.  Journal  of  Comp,  and  Appl. 
Math.,  27:323-330,  1989. 

[74]  Horst  D.  Simon.  Direct  sparse  matrix  meth¬ 
ods.  In  James  C.  Almond  and  David  M.  Young, 
editors.  Modern  Numerical  Algorithms  for  Su¬ 
percomputers,  pages  325-444,  Austin,  1989.  The 
University  of  Texas  at  Austin,  Center  for  High 
Performance  Computing. 

[75]  G.  L.  G.  Sleijpen  and  H.A.  Van  der  Vorst.  Main¬ 
taining  convergence  properties  of  BICGSTAB 
methods  in  finite  precision  arithmetic.  Tech¬ 
nical  report,  University  Utrecht,  Department  of 
Mathematics,  1994. 

[76]  G.  L.  G.  Sleijpen  and  H.A.  Van  der  Vorst.  Re¬ 
liable  updated  residuals  in  hybrid  Bi-CG  meth¬ 
ods.  Preprint  Nr.  886,  Dept.  Math.,  University 
Utrecht,  1994. 

[77]  G.  L.  G.  Sleijpen,  H.A.  Van  der  Vorst,  and  D.  R. 
Fokkema.  Bi-CGSTAB(^)  and  other  hybrid  bi-cg 
methods.  Numerical  Algorithms,  7:75-109, 1994. 

[78]  G.  L.  G.  Sleijpen  and  D.  R.  Fokkema. 
BICGSTAB(^)  for  linear  eqnations  involving 
unsymmetric  matrices  with  complex  spectrum. 
ETNA,  1:11-32,  1993. 

[79]  P.  Sonneveld.  CGS:  a  fast  Lanczos-type  solver 
for  nonsymmetric  linear  systems.  SIAM  J.  Sci. 
Statist.  Comput.,  10:36-52,  1989. 

[80]  A.  van  der  Sluis  and  H.  A.  van  der  Vorst.  The 
rate  of  convergence  of  conjugate  gradients.  Nu- 
mer.  Math.,  48:543-560,  1986. 


[81]  A.  van  der  Sluis  and  H.  A.  van  der  Vorst.  Nu¬ 
merical  solution  of  large  sparse  linear  algebraic 
systems  arising  from  tomographic  problems.  In 
G.  Nolet,  editor,  Seismic  Tomography,  chap¬ 
ter  3,  pages  49-83.  Reidel  Pub.  Comp.,  Dor¬ 
drecht,  1987. 

[82]  A.  van  der  Sluis  and  H.A.  van  der  Vorst.  SIRT- 
and  CG-type  methods  for  the  iterative  solution 
of  sparse  linear  least-squares  problems.  Lin.  Alg. 
and  Its  Appl,  130:257-302,  1990. 

[83]  H.  A.  van  der  Vorst.  The  convergence  behavior  of 
some  iterative  solution  methods.  In  R.  Gruber, 
J.  Periaux,  and  R.  P.  Shaw,  editors,  Proc.  of 
the  fifth  Int.Svmp.  on  Numer.  Methods  in  Eng., 
1989.  vol  1. 

[84]  H.  A.  van  der  Vorst.  The  convergence  behavionr 
of  preconditioned  CG  and  CG-S  in  the  pres¬ 
ence  of  rounding  errors.  In  O.  Axelsson  and 
L.  Yu.  Kolotilina,  editors.  Preconditioned  Conju¬ 
gate  Gradient  Methods,  Berlin,  1990.  Nijmegen 
1989,  Springer  Verlag.  Lecture  Notes  in  Mathe¬ 
matics  1457. 

[85]  H.  A.  van  der  Vorst.  Experiences  with  parallel 
vector  computers  for  sparse  linear  systems.  Su¬ 
percomputer,  37:28-35,  1990. 

[86]  H.  A.  van  der  Vorst.  Bi-CGSTAB:  A  fast  and 
smoothly  converging  variant  of  Bi-CG  for  the 
solution  of  non-symmetric  linear  systems.  SIAM 
J.  Sci.  Statist.  Comput.,  13:631-644,  1992. 

[87]  H.  A.  van  der  Vorst.  Conjugate  gradient  type 
methods  for  nonsymmetric  linear  systems.  In 
R.  Beauwens  and  P.  de  Groen,  editors.  Iterative 
Methods  in  Linear  Algebra,  Amsterdam,  1992. 
IMACS  Int.  Symp.,  Brussels,  Belgium,  2-4  April, 
1991,  North-Holland. 

[88]  H.  A.  van  der  Vorst  and  C.  Vuik.  The  superlin- 
ear  convergence  behaviour  of  GMRES.  JCAM, 
48:327-341,  1993. 

[89]  H.  A.  van  der  Vorst  and  C.  Vuik.  GMRESR:  A 
family  of  nested  GMRES  methods.  Num.  Lin. 
Alg.  with  Appl.,  1:369-386,  1994. 

[90]  R.  S.  Varga.  Matrix  Iterative  Analysis.  Prentice- 
Hall,  Englewood  Cliffs  N.J.,  1962. 

[91]  P.  K.  W.  Vinsome.  ORTOMIN:  an  iterative 
method  for  solving  sparse  sets  of  simultaneous 
linear  equations.  In  Proc. Fourth  Symposium  on 
Reservoir  Simulation,  pages  149-159.  Society  of 
Petroleum  Engineers  of  AIME,  1976. 

[92]  H.  E.  Walker.  Implementation  of  the  GMRES 
method  using  Householder  transforma-  tions. 
SIAM  J.  Sci.  Stat.  Comp.,  9:152-163,  1988. 


3-39 


[93]  0.  Widlund.  A  Lanczos  method  for  a  class 
of  noiisymmetric  systems  of  linear  equations. 
SIAM  J.  Numer.  Anal,  15:801-812,  1978. 

[94]  J.  H.  Wilkinson.  The  Algebraic  Eigenvalue  Prob¬ 
lem,.  Clarendon  Press,  Oxford,  1965. 

[95]  L.  Zhou  and  H.  F.  Walker.  Residual  smoothing 
techniques  for  iterative  methods.  SIAM  J.  Sci. 
Comput,  15:297-312,  1994. 


structured  Grid  Solvers  I 

Accurate  and  Efficient  Flow  Solvers  for  3D  Applications  on  Structured  Meshes 

Norbert  Kroll,  Rolf  Radespiel,  Cord-C.  Rossow 


4-1 


Institute  for  Design  Aerodynamics 
DLR,  Lilienthalplatz  7,  38108  Braunschweig,  Germany 


SUMMARY 

This  lecture  is  devoted  to  the  parallelization  of  blockstruc- 
tured  grid  solvers  for  industrial  applications.  It  is  divided 
into  two  parts.  Part  I  describes  well  established  numerical 
algorithms  with  emphasis  on  spatial  discretization  and  time 
stepping  schemes.  Attention  is  focused  on  the  multigrid 
technique  which  is  one  of  the  most  promising  approach  to 
improve  the  efficiency  of  numerical  methods.  Finally,  sev¬ 
eral  large-scale  computations  are  shown  which  demon¬ 
strate  the  ability  of  current  blockstructured  flow  solvers. 

Part  II  of  the  lecture  addresses  various  aspects  of  the  paral¬ 
lelization  of  such  flow  solvers. 

LIST  OFSYMRnT..S 
c  speed  of  sound 

D  dissipative  operator 

e  internal  energy  per  unit  mass 

E  specific  total  energy 

-c  -V  = 

F  ,F  inviscid  and  viscous  part  of  flux  tensor  F 

H  specific  total  enthalpy 

K  heat  conductivity 

M  Mach  number 

n  outward  pointing  normal 

p  pressure 

Pr  Prandtl  number 

q  vector  of  cartesian  velocities 

R  discrete  flux  balance 

Re  Reynolds  number 

S  surface  vector 

t  time 

u,v,w  cartesian  velocity  components 

y  control  volume 

— ^ 

W  vector  of  conserved  variables 

a  angle  of  attack 

y  ratio  of  specific  heats 

A.  spectral  radius  of  flux  Jacobian 


nondimensional  viscosity 
tp  density 

Indices 

I  laminar 

t  turbulent 

°°  free  stream 

i,j,k  indices  of  grid  node 

1.  INTRODUCTION 

Numerical  flow  simulations  have  found  their  way  into  the 
aerodynamic  design  cycles  of  aerospace  vehicles.  Not  only 
do  these  simulations  reduce  turn-around  time  and  cost,  but 
they  also  offer  flow  parameter  variations  which  are  not 
possible  with  wind  tunnel  testing.  On  the  other  hand,  nu¬ 
merical  simulations  in  aerodynamics  are  still  an  engineer¬ 
ing  challenge.  The  governing  partial  differential  equations 
do  not  always  represent  a  well-posed  problem,  that  is, 
uniqueness  and  existence  of  a  solution  is  usually  not 
proven  and  it  is  difficult  to  formulate  suitable  initial  and 
boundary  conditions.  Moreover,  the  existence  of  turbu¬ 
lence  in  the  majority  of  relevant  flow  problems  makes  the 
direct  solution  of  the  governing  unsteady  Navier-Stokes 
equations  impossible  because  the  relevant  scales  vary  too 
much.  The  problem  may  be  circumvented  by  averaging  the 
turbulent  motion.  This  yields  the  Reynolds-averaged  Na¬ 
vier-Stokes  equations  which  can  be  solved  if  a  turbulence 
model  is  provided  for  closure.  Suitable  turbulence  models 
have  been  under  investigation  over  the  last  70  years,  and 
the  matter  is  still  not  solved  to  satisfaction.  However,  in  the 
present  lecture  we  will  assume  that  the  effect  of  turbulence 
can  be  described  by  adding  a  turbulent  viscosity  and  heat 
conductivity  to  their  laminar  counterparts.  Even  then,  flows 
over  aerodynamic  configurations  display  flow  phenomena 
with  very  different  scales  and  with  highly  nonlinear  behav¬ 
ior.  We  mention  here  the  laminar  and  turbulent  boundary 
layers  at  very  high  Reynolds  numbers  and  their  interaction 
with  shocks  as  an  example.  Numerical  simulation  of  such 
flow  problems  often  converge  slowly  because  the  discreti¬ 
zed  mathematical  model  is  stiff. 
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The  present  lecture  describes  well  established  techniques 
used  for  numerical  simulations  of  complex  aerodynamic 
flows  based  on  blockstructured  meshes.  We  restrict  our¬ 
selves  to  problems  with  steady  mean  flow,  that  is,  we  want 
to  obtain  steady-state  solutions  of  the  Euler  equations  gov¬ 
erning  inviscid  flows  and  of  the  Reynolds-averaged  Na- 
vier-Stokes  equations  for  viscous  flows.  In  this  paper  atten¬ 
tion  is  focused  on  the  general  description  of  the  two  major 
parts  of  the  numerical  method.  These  are  the  spatial  discre¬ 
tization  and  time  stepping  algorithms.  The  parallelization 
issues  of  blockstructured  flow  solvers  for  industrial  appli¬ 
cations  are  treated  in  the  second  lecture  [1]. 

With  the  spatial  discretization  of  the  governing  equations 
we  seek  to  obtain  accurate  solutions  with  as  few  as  possible 
discrete  points  in  the  flow  domain.  Care  must  be  taken  to 
resolve  all  relevant  flow  phenomena,  i.e.  smoothly  varying 
regions  of  inviscid  flows,  flow  discontinuities  as  shocks 
and  slip  lines,  and  viscous  layers  which  are  governed  by 
diffusion.  Moreover,  numerical  analysis  and  well-known 
experience  show  that  the  choice  of  the  spatial  discretization 
also  influences  the  convergence  of  the  overall  method  to 
the  desired  steady-state. 

Possibilities  to  improve  convergence  to  steady-state  solu¬ 
tions  by  improving  or  adding  numerical  techniques  has  at¬ 
tracted  the  work  force  of  many  researchers  over  the  last  20 
years.  We  will  concentrate  on  one  of  the  most  promising 
approaches,  which  is  called  multigrid.  The  present  state  of 
the  art  in  the  use  of  multigrid  for  the  solution  of  the  hyper¬ 
bolic  flow  equations  with  time-stepping  schemes  is  de¬ 
scribed  in  detail,  analyzed,  and  demonstrated  with  a  variety 
of  sample  calculations. 

Finally,  we  present  several  large-scale  computations  which 
demonstrate  the  usefulness  of  the  efforts  to  improve  accu¬ 
racy  and  convergence  of  current  flow  solvers. 


2.  GOVERNING  EQUATIONS 


The  most  general  description  of  the  fluid  flow  is  obtained 
from  the  time  dependent  compressible  Navier-Stokes  equa¬ 
tions  which  express  the  conservation  laws  for  mass,  mo¬ 
mentum  and  energy  for  viscous  fluids.  For  turbulent  flows 
the  so-called  Reynolds-averaged  Navier-Stokes  equations 
are  exploited.  They  are  derived  from  the  Navier-Stokes 
equations  by  introducing  a  time-averaging  procedure.  The 
laws  of  motion  are  then  expressed  for  the  mean,  time-aver¬ 
aged,  turbulent  quantities.  By  this  means  the  equations  for 
turbulent  flows  look  the  same  as  the  equations  for  laminar 
flow. 


The  integral  form  of  the  three-dimensional  Reynolds-aver¬ 
aged  Navier-Stokes  equations  using  nondimensional  vari¬ 
ables  in  a  cartesian  coordinate  system  can  be  written  as 


^Jwdv -I-  J  F  •  nds  =  0  (2.1) 

V  3V 

where 

— ^  1  T 

W  =  [p,  pu,  pv,  pw,  pE]  ^ 

is  the  vector  of  conserved  quantities  with  p  ,u,v,w  and  E 
denoting  the  density,  cartesian  velocity  components  and 
specific  total  energy,  respectively.  V  denotes  an  arbitrary 
control  volume  fixed  in  time  and  space  with  boundary  3V 
and  the  outer  normal  n .  The  total  enthalpy  is  given  by 

H  =  E-(-p/p  (2.2) 

The  flux  tensor  F  may  be  divided  into  its  inviscid  (convec¬ 
tive)  part  f'^  and  its  viscous  part  F  as 

F  =  F-f''  (2.3) 

with 
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where  kx ,  ky ,  kz  denote  the  cartesian  coordinate  direc¬ 
tions.  Assuming  that  air  behaves  as  calorically  perfect  gas, 
the  pressure  is  calculated  by  the  equation  of  state 


(2,4) 


p.  (Y-.)p(E-“-i±2^) 

where  y  denotes  the  ratio  of  specific  heats.  The  temperature 
T  is  given  by 


cally  defined  functions  [3]  are  used.  These  functions  relate 
the  pressure  to  both,  the  density  and  specific  internal  en¬ 
ergy  and  take  into  account  exitation  of  vibration  and  disso¬ 
ciation  of  O2  and  N2  molecules.  The  temperature,  viscosity 
and  heat  conductivity  are  similarly  computed. 


T  =  p/p  .  (2.5) 

The  elements  of  the  shear-stress  tensor  and  the  heat-flux 
vector  are  given  by  the  equations  for  Newtonian  fluid 

=  2pu^-2/3p(u^  +  Vy  +  w^) 

Oyy  =  2pVj,-2/3p{u^4-Vy  +  w^) 

^zz  =  2pw^-2/3p(u^-t-Vy  + w^) 

^xy  =  ^x  =  fl(^  +  ^)  (2.6) 

'^xz  =  Ozx  =  + 

^z  =  ^zy  =  P(V,  +  W^) 

„3T  ty9T  i^9T 

For  laminar  flow  the  nondimensional  viscosity  p  is  as¬ 
sumed  to  follow  the  Sutherland  law 


f  iiOk 


with  Moe,  REoo  and  T  denoting  the  free  stream  Mach  num¬ 
ber,  Reynolds  number  and  the  dimensional  temperature,  re¬ 
spectively.  The  heat  conductivity  K  is  given  by 


K  =  ^  ^ 

y  -  1  Pr 


with  Pr  being  the  Prandtl  number. 

For  turbulent  flows,  the  laminar  viscosity  p  in  eq.  (2.7)  is 
replaced  by  p  +  p,  and  p/Pr  in  eq.  (2.8)  is  replaced  by 
p/Pr -I- Pj/Pq,  where  the  eddy  viscosity  p^  and  the  turbu¬ 
lent  Prandtl  Pq  number  are  provided  by  a  turbulence 
model.  For  the  transonic  airfoil  calculations  presented  in 
this  paper  the  algebraic  turbulence  model  of  Baldwin/Lo¬ 
max  [2]  is  used. 

For  hypersonic  flow  calculations  it  is  assumed  that  air  be¬ 
haves  as  reacting  air  in  thermochemical  equilibrium.  In  this 
case  a  modified  ratio  of  specific  heats  is  used.  Furthermore, 
the  speed  of  sound  is  given  by 


3p  p3e 

le  =  const  Ip  =  const 


where  e  is  the  internal  energy  per  unit  mass.  For  the  calcu¬ 
lation  of  the  effective  ratio  of  specific  heats  and  for  the  par¬ 
tial  derivatives  of  pressure  in  eq.  (2.9),  piecewise  analyti¬ 


3.  SPATIAL  DISCRETIZATION  SCHEMF. 

The  derivation  of  the  conservation  laws  in  integral  form 
only  requires  the  assumption  that  the  density  is  twice  con¬ 
tinuously  differentiable  with  respect  to  time.  Therefore,  in 
contrast  to  the  differential  form,  the  integral  form  of  the 
governing  equations  does  not  impose  any  assumptions  on 
the  regularity  of  the  solution.  This  is  extremely  important 
since  discontinuities  such  as  shock  waves  and  slip  tines  oc¬ 
cur  in  most  of  the  relevant  flow  fields. 

The  discretization  of  the  integral  form  of  the  conservation 
laws  leads  to  finite  element  or  finite  volume  methods.  This 
paper  focuses  on  the  discussion  of  the  finite  volume  ap¬ 
proximation  based  on  structured  computational  meshes. 


In  finite  volume  methods  the  flow  field  is  subdivided  into  a 
set  of  non-overlapping  cells  which  cover  the  whole  domain 
without  gaps.  On  each  cell  the  conservation  laws  in  inte¬ 
gral  form  are  applied  which  also  in  the  discrete  formulation 
ensure  the  conservation  of  mass,  momentum  and  energy.  In 
general,  the  control  volumes  can  have  arbitrary  shapes. 
With  respect  to  computational  efficiency,  however,  very  of¬ 
ten  structured  hexahedral  cells  are  used  for  3D  calcula¬ 
tions.  For  practical  applications  the  control  volumes  are 
provided  by  a  body-fitted  mesh  generated  by  grid  genera¬ 
tion  packages  using  curvilinear  coordinates  (see  Fig.  1). 
The  only  required  data  concerning  the  grid  are  the  cartesian 
coordinates  of  the  vertices.  Hence,  no  global  transforma¬ 
tion  of  the  governing  equations  into  the  curvilinear  coordi¬ 
nate  system  is  necessary. 

Through  the  application  of  the  integral  form  of  the  Navier- 
Stokes  equations  a  discrete  flux  balance  is  obtained  for 
each  control  volume  which  can  be  used  to  approximately 
determine  the  change  of  flow  quantities  with  respect  to 
time  in  particular  points.  Various  finite  volume  formula¬ 
tions  are  known  in  the  literature.  They  differ  in  the  arrange¬ 
ment  of  control  volumes  and  update  points  for  the  flow 
variables.  The  most  frequently  used  schemes  are  the  cell- 
centered,  the  cell-vertex  and  the  node-centered  approach. 
They  are  sketched  in  Fig.  2.  For  the  node-centered  and 
cell-vertex  scheme  the  flow  variables  are  associated  with 
the  cell  vertices,  whereas  for  the  cell-centered  scheme  they 
are  located  at  the  center  of  the  cell.  Each  of  these  schemes 
has  advantages  and  disadvantages.  For  example,  using  a 
central  discretization  it  can  be  shown  that  for  stretched  or 
screwed  meshes  the  discretization  errors  from  the  cell-cen¬ 
tered  formulation  is  larger  that  those  of  the  node-based 
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schemes  [4],  However,  for  smooth  meshes  the  spatial  accu¬ 
racy  is  the  same  for  all  schemes.  On  the  other  hand,  numer¬ 
ical  experience  has  shown  that  for  high  speed  flows  the 
cell-centered  and  the  node-centered  arrangement  seem  to 
be  more  suited,  especially  in  combination  with  an  upwind- 
biased  discretization  operator  [5]. 

This  paper  focuses  on  the  cell-vertex  and  node-centered 
formulation.  In  both  cases  the  spatial  discretization  leads  to 
an  ordinary  differential  equation  for  the  rate  of  change  of 
the  conservative  flow  variables  in  each  grid  point 


-W-  ■  1 


where  Rij,  t  and  Rjj,  k  represent  the  approximation  of 
the  inviscid  and  viscous  net  flux  of  mass,  momentum  and 
energy  for  a  particular  control  volume  arrangement  with 
volume  Vjjk  surrounding  the  grid  node  (i,j,k).  The  fluxes 
can  be  approximated  using  either  central  or  upwind  discre¬ 
tization  operators.  While  classical  central  difference 
schemes  perform  admirably  for  inviscid  sub-,  trans-  and 
even  low  supersonic  flows,  problems  arise  near  strong  dis¬ 
continuities  in  high  Mach  number  flows.  Moreover,  in  re¬ 
cent  papers  (e.g.  [6,7])  it  has  been  pointed  out  that  central 
schemes  show  deficiencies  in  the  resolution  of  viscous 
flows  due  to  the  unsuited  scaling  of  the  scalar  artificial  dis¬ 
sipation  implemented  in  most  of  the  standard  methods. 
This  lack  of  a  suitable  high-resolution  capability  has  been 
considered  as  a  major  problem  of  central  schemes  and  has 
led  to  the  development  of  a  variety  of  upwind-biased  algo¬ 
rithms.  These  schemes  rely  on  local  wave  propagation  the¬ 
ory  for  the  differencing  of  the  convective  terms  of  the  gov¬ 
erning  equations.  This  is  not  only  important  for  capturing 
flow  discontinuities  but  also  it  can  lead  to  a  high-resolution 
scheme  for  viscous  flows,  provided  the  linear  waves  are 
properly  taken  into  account.  In  the  following,  various  dis¬ 
cretization  schemes  for  the  convective  terms  are  discussed. 
The  special  merits  and  shortcomings  of  each  scheme  are 
highlighted.  In  discretizing  the  Navier-Stokes  equations, 
virtually  all  schemes  rely  on  a  centered  approximation  of 
the  viscous  fluxes.  A  brief  description  is  given  at  the  end  of 
this  chapter. 


3.2  Central  Differencing  with  Scalar  Dissipation 


The  central  differencing  of  the  convective  terms  of  the 
governing  equations  discussed  here  is  based  on  the  cell- 
vertex  scheme  [4,8].  In  this  formulation  the  update  of  the 
flow  variables  in  grid  node  (i,J,k)  is  a  function  of  the  dis¬ 
crete  flux  balances  of  the  surrounding  eight  cells  (see 
Fig.  3).  The  term  R^j.k  in  eq.(3.1)  can  be  expressed  as 


Ri,j,k  =  Gij,k  +  Gj- i,j,  k  +  Gi,j_  i,k  +  Gi_  1  j- 1,  k. 

+  Gij,  k-i  +Gi-i,j,k-i  +Gij-i,k-i  +Gi_i,j_i,k-l 


(3.2) 


with  Gi,j,  k  representing  the  convective  flux  for  the  mesh 
cell  with  vertices  ((i+n,J,k),  (i-rn,j,k-Hl),  (i-t-n,j-t-l,k), 
i+n,j-rl,k-i-l),  n=0,l}.  Accordingly,  the  volume  Vjjk  in 
eq.(3.1)  represents  the  sum  of  the  volumes  of  the  corre¬ 
sponding  cells  surrounding  the  node  (i,j,k). 

The  net  flux  Gi.j,  k  is  given  by  the  sum  of  the  inviscid 
fluxes  through  all  cell  faces  of  the  particular  mesh  cell  (see 
Fig.  3.b) 


Gi.j,  k  = 


”  8i  +  l,j,  k  Si,  j,  k  j  +  1,  k 
k  k+1  “Ikj,  k 


where  the  flux  through  the  cell  face  i  j  k  evaluated 
using  an  arithmetric  average  of  the  flux  quantities  at  the 
vertices.  That  is 


Si  +  1.  j,  k 


1 

-  4^1+  l.j.k  • 


Fi  +  i,j.  k  +  Fi  +  l.j  +  l.k  +  Fi+  i,j,  k  +  I  +  Fi 


i  +  I,  j  +  1,  k  +  1  J 


where  l'+i,j.k  denotes  the  surface  vector  of  cell  face 
Sj^ ,  j  k  calculated  by  projecting  the  cell  face  on  the  corre¬ 
sponding  coordinate  surface. 

A  close  inspection  of  eqs.  (3.2)  and  (3.3)  shows,  th^  due 
to  the  fact  that  the  fluxes  across  inner  faces  cancel,  Rj,j,  k 
represents  the  flux  balance  over  a  super  cell  formed  by  the 
eight  neighboring  cells  of  node  (i,j,k).  According  to  [4,8], 
the  scheme  is  at  least  first  order  accurate,  if  the  normal  vec¬ 
tor  on  each  cell  face  is  a  smooth  function  with  respect  to 
grid  refinement  and  if  the  cell  faces  do  not  degenerate  to 
triangles.  On  smooth  meshes  the  discretization  is  second- 
order  accurate. 

The  finite  volume  discretization  based  on  central  averaging 
is  not  dissipative,  which  means  that  high  frequency  oscilla¬ 
tions  in  the  solution  are  not  damped.  In  order  to  avoid  these 
spurious  oscillations,  dissipative  terms  have  to  be  explic¬ 
itly  introduced.  In  most  central  schemes  the  well  known 
scalar  dissipation  model  of  Jameson  et  al  [9]  is  imple¬ 
mented.  It  uses  a  blend  of  second  and  fourth  differences  of 
the  flow  variables.  In  order  to  preserve  the  conservation 
form  of  the  numerical  scheme,  the  artificial  dissipative 
terms  are  introduced  by  adding  dissipative  fluxes  to  the 
semi-discrete  system  (3.1) 

^Wi,j,k  =  -r^^Rij.k-Rij.k-Di.j.kj.  (3.5) 


The  dissipative  operator  Djj,  k  is  defined  as 
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^  ^ 

Di,j,  k  =  ci.  I  .  -d.  I  -kd,  .  1 

i  +  j.J'k  i-j.J.k  i.j  +  2.k 


-  d  1  -I-  d  I  -  cl  I 

•  j  2’  2  2 


where  the  dissipative  flux  di  +  i/2,j,  k  is  given  as 

d.  !.,=«  I  (wi  +  i,j,k-Wi,j,k) 

.4-2,. a  i  +  ij,k[  i  +  [j,k^  ^ 

(w|  +  2,j.k-3Wi  +  ,,j.k  +  3Wi,j,k-Wi_,.|,k]}. 


As  mentioned  above,  the  dissipation  in  each  coordinate  di¬ 
rection  is  scaled  the  same  by  the  average  of  the  spectral  ra¬ 
dii  of  all  flux  Jacobians.  This  leads  to  excessively  large  dis¬ 
sipation  levels  for  cells  with  high-aspect  ratios  which  are 
often  required  for  accurate  and  efficient  calculations  of  vis¬ 
cous  flows.  Therefore,  according  to  Martinelli  [13]  the 
scaling  factor  of  the  dissipative  term  is  adjusted  for  each 
coordinate  direction  taking  into  account  a  varying  cell  as¬ 
pect  ratio  (see  also  [14]).  The  scaling  function  is 
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Here,  and  are  adaptive  coefficients 

designed  to  switch  on  enough  dissipation  where  it  is 
needed.  The  coefficient  is  chosen  such  that  the 

dissipative  terms  have  a  proper  weightage.  According  to 
[9],  the  value  is  given  as  an  average  of  the  spectral  radii  of 
the  flux  Jacobians  associated  with  the  three  curvilinear  co¬ 
ordinates. 

The  coefficients  and  are  adapted  to  the  local  flow 
gradients  by 


p(2) 

'"i-H/2..j.  k 


^iVi  +  2.j,k'''i  +  i.j.k’ \j.k'  v.-i.j,k)  (3.8) 


where  ^  is  defined  as 


V 


i.j,  k 


|pi+l.j.k  ^Pj,  j,  k  Pj  -  I.  i.  k| 


(3.10) 


and  k^'*'  are  small  constants.  Typical  values  for  k^^i 
and  k*-^*  are  1/2  and  1/64,  respectively.  The  dissipation  op¬ 
erators  in  j-,  and  k-direction  are  defined  in  a  similar  man¬ 
ner. 

The  coefficient  is  proportional  to  the  second  difference 
of  pressure  and  therefore  proportional  to  the  square  of  the 
mesh  size  in  smooth  regions  of  the  flow,  while  is  of  or¬ 
der  one.  Since  the  operator  in  eq.  (3.7)  contains  differences 
of  the  flow  variables  which  are  not  divided  by  the  mesh 
size,  the  dissipative  flux  d  i+i/2j,k  is  of  third  order.  How¬ 
ever,  in  regions  where  the  pressure  changes  rapidly,  as  in 
the  case  of  shock  waves,  the  term  V; j  ^  is  of  order  one  and 
with  eq.  (3.9)  the  third  order  difference  operator  in  eq.  (3.7) 
is  switched  off.  The  dissipation  is  then  of  first  order  and  the 
central  finite  volume  scheme  behaves  like  a  first-order  ac¬ 
curate  scheme.  The  sensitivity  of  the  numerical  solution 
with  respect  to  the  dissipation  parameter  has  been  studied 
in  detail  in  [10,11].  Since  the  dissipative  fluxes  are  formed 
by  blended  second  and  fourth  differences,  the  evalutation 
of  these  terms  near  boundaries  requires  special  care.  The 
treatment  at  boundaries  is  described  in  [12]  in  more  detail. 


are  the  spectral  radii  of  the  flux  Jacobians  in  i-,  j-,  k-direc- 
tion,  respectively,  q  =[u,v,w]^  is  the  vector  of  cartesian  ve¬ 
locities  and  c  is  the  speed  of  sound.  s‘.  S'",  S  are  the  cell 
face  vectors  associated  with  i-,  j-,  and  k-direction  of  the 
body-fitted  coordinate  system.  The  use  of  the  maximum 
function  in  the  definition  of  (J)  is  important  for  grids  where 
and  X^/X'  are  very  large  and  of  same  order  of  mag¬ 
nitude.  In  this  case,  if  these  ratios  are  summed  rather  than 
taking  the  maximum,  too  large  dissipative  terms  are  ob¬ 
tained,  which  will  degrade  the  solution.  It  has  been  found 
that  for  the  exponent  P  the  choice  p=0.5  yields  a  robust 
scheme. 

The  transonic  turbulent  flow  over  the  BLAE  2822  airfoil  is 
used  to  demonstrate  the  capabilities  of  the  method  for  high 
Reynolds  number  turbulent  flows.  The  well-known  test 
case  M^=0.73,  a=2.79°  and  Re^  =6.5x10^  has  been  con¬ 
sidered.  The  accuracy  of  the  central  scheme  with  scalar 
dissipation  is  examined  using  a  variation  of  the  grid  den¬ 
sity.  For  this  purpose  a  sequence  of  a  coarse  (193x33 
points),  medium  (385x65  points)  and  fine  grid  (577x97 
points)  has  been  created.  A  C-grid  topology  has  been  se¬ 
lected  with  a  first  spacing  of  lO'^  chord  lengths  away  from 
the  wall.  The  calculations  have  been  carried  out  with  the 
Baldwin/Lomax  turbulence  model.  The  variation  of  the  co- 
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efficients  for  lift,  pressure  drag  and  friction  drag  with  num¬ 
ber  of  mesh  points  N  is  presented  in  Fig.  4.  The  influence 
of  the  dissipation  parameter  of  the  second  and  fourth  dif¬ 
ference  dissipation  operator  is  also  indicated.  On  coarse 
meshes,  the  discretization  error  is  obviously  dominated  by 
the  artificial  dissipation.  The  integral  values  show  a  large 
variation.  The  fine  meshes  allow  the  extrapolation  of  the 
coefficients  to  their  values  for  an  infinitely  fine  mesh.  For 
the  mesh  with  385x65  points,  the  predicted  lift  is  within  1.5 
percent,  the  pressure  drag  is  within  3  counts  and  the  fric¬ 
tion  drag  is  within  0.3  count  of  the  extrapolated  values.  For 
the  fine  mesh  with  577x97  points,  the  predicted  lift  is 
within  0.5  percent,  the  pressure  drag  within  1  count  and  the 
skin  friction  drag  within  0. 1  count.  Fig.  5  shows  pressure 
and  skin  friction  distributions  for  different  grid  densities. 
The  experimental  values  [15]  are  also  included.  The  main 
features  of  the  flow  are  essentially  captured  on  the  medium 
mesh.  The  differences  between  medium  and  fine  mesh  are 
small. 

In  Fig.  6  the  transonic  flow  around  the  ONERA-M6  wing 
is  considered.  The  computational  domain  is  discretized  us¬ 
ing  a  C-type  topology  in  the  streamwise  direction  and  an 
0-type  topology  in  the  spanwise  direction  with  289x65x45 
points.  A  somewhat  coarser  mesh  has  also  been  used  in  or¬ 
der  to  indicate  the  influence  of  the  grid  density  for  three-di¬ 
mensional  viscous  flows.  The  commonly  used  test  case 
M^=0.84,  a=3.06°  and  Re^=llxl0^  has  been  considered. 
Here,  again  the  Baldwin/Lomax  turbulence  model  has  been 
used.  The  pressure  distributions  along  several  spanwise 
stations  of  the  wing  are  displayed  in  Fig.  3.6.  The  results  of 
the  fine  mesh  agree  well  with  those  from  the  coarser  mesh 
and  with  experimental  data  [16]. 

A  comprehensive  validation  of  the  central  cell-vertex 
scheme  with  scalar  dissipation  can  be  found  in  [4,  14,  17, 
18]. 

3.3  Central  Differencing  with  Matrix  Dissipation 
As  shown  in  the  literature  and  indicated  by  the  results 
above,  it  is  possible  to  obtain  grid-converged  solutions  for 
transonic  viscous  flows  with  central  schemes,  provided 
sufficiently  fine  meshes  are  used  for  the  computations. 
However,  for  efficiency  reasons,  especially  for  3-D  appli¬ 
cations,  the  accuracy  of  the  solution  needs  to  be  improved 
on  a  given  grid,  in  order  to  reduce  the  number  of  grid 
points  required  for  obtaining  a  specified  level  of  accuracy. 
The  major  drawback  of  standard  central  schemes,  as  the 
one  presented  above,  is  the  scalar  form  of  the  artificial  vis¬ 
cosity.  In  this  approach  the  dissipation  of  each  conserva¬ 
tion  equation  is  scaled  the  same.  The  spectral  radius  of  the 
flux  Jacobian  associated  with  the  corresponding  coordinate 
direction  is  employed  as  the  scaling  factor.  As  suggested 
by  Turkel  [19]  and  Swanson  and  Turkel  [20]  the  central 
discretization  can  be  improved  by  replacing  the  scalar  dis¬ 


sipation  by  a  matrix-valued  dissipation  using  ideas  from 
the  concept  of  upwind  schemes.  In  this  case,  the  dissipation 
in  a  particular  coordinate  direction  for  each  equation  is 
scaled  by  the  specific  eigenvalue  associated  with  the  corre¬ 
sponding  flux  Jacobian  matrix. 

In  the  case  of  matrix-valued  dissipation  the  dissipative  flux 
di  +  1/2,  j,  k  of  eq.  (3.6)  through  interface  in- 1/2  is  defined  as 


d.  1  , 

1  +  2.J,  k 


|A|  ,  .  (wi+i,j,k-Wij,k)  (3 

I  +  J.J.  k  [  i  +  j.J.  k  k  ’ 


-eH) 

1  .  ,  V 

>  +  2’  J.  k 


I  Wi  +  2,j,  k-3Wj  +  i,j,k  +  3Wij_k-Wi_|,j^k  J  1. 


In  contrast  to  eq.  (3.7),  the  differences  of  the  flow  quanti¬ 
ties  are  now  scaled  by  a  matrix  which  is  given  by 

|A|  ,  =  (T|aJ(T)->)  ,  (3.15) 

with  T  and  (T)'*  being  the  right  and  left  eigenvector  matri¬ 
ces  of  the  flux  Jacobian  A  associated  with  the  i-direction  of 
the  curvilinear  coordinate  system.  |A^|  denotes  a  diagonal 
matrix,  where  the  elements  are  the  absolute  values  of  the 
eigenvalues  of  A.  The  eigenvalues  of  A  are  given  by 

X  =^S'  ,n  =1,2,3 

=  ^s'  -k  c|s1  (3.16) 

X^  =  q  ■  s'  -  c|s‘|  . 

The  matrices  in  eq.  (3.15)  are  evaluated  at  the  interface 
i-i-1/2  using  simple  averages  of  the  flow  quantities  W  at 
grid  nodes  (i,j,k)  and  (i+l,j,k).  According  to  [19,20],  by 
taking  advantage  of  the  special  form  of  the  elements  of 
|A| ,  the  matrix  vector  products  occurring  in  eq.  (3.14)  can 
be  replaced  by  the  products  of  row  and  column  vectors. 
This  leads  to  a  simpler  and  more  efficient  procedure  for  the 
evaluation  of  the  dissipative  flux.  For  details  see  also  [21]. 
The  parameter  and  eH)  are  essentially  the  same  as  in 
the  case  of  the  scalar  dissipation.  Also  here,  typical  values 
of  the  coefficients  k^^^  and  k^'*^  are  1/2  and  1/64,  respec¬ 
tively.  Note  that,  if  [A]  is  replaced  by  its  spectral  radius, 
then  the  usual  scalar  dissipation  outlined  above  is  obtained. 
As  can  be  seen  in  eqs.  (3.14)  and  (3.15),  for  each  flow 
equation  the  dissipation  is  scaled  by  the  corresponding 
eigenvalue.  In  practice,  however,  the  eigenvalues  as  given 
in  eq.  (3.16)  can  not  be  used.  At  stagnation  points  the 
eigenvalues  X,),  Xr^  and  ^3  vanish,  whereas  near  sonic  lines 
(M=l)  the  eigenvalue  ^,4  or  X,5  approaches  zero.  It  is  well 
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known,  that  for  a  central  difference  scheme  zero  artificial 
viscosity  can  create  numerical  difficulties.  Therefore,  the 
eigenvalues  are  limited  in  the  following  manner  according 
to  [19,20] 

l^nl  =  max(|XJ,  V,  p(A)),  n  =  1,2,3 

1^4!  =  max(|?i^|,  p(A))  (3.17) 

|X,|  =  max(|7..;|.  P(A)) 

where  V|  and  Vj,  are  small  coefficients  which  limit  the 
eigenvalues  associated  with  the  linear  and  nonlinear  char¬ 
acteristic  fields  to  a  minimum  value  that  is  a  fraction  of  the 
spectral  radius  p(A)  (largest  eigenvalue)  of  A.  The  parame¬ 
ters  V|  and  Vp  are  determined  through  numerical  experi¬ 
ments  such  that  shocks  are  captured  without  spurious  oscil¬ 
lations  and  good  convergence  behavior  is  still  maintained. 
Typical  values  are  0.3<V]<0.6  and  Vn=0.4  (see  [21]).  It 
should  be  noted,  that  in  the  case  of  V|=V„=1  the  scalar 
form  of  the  artificial  dissipation  is  recovered. 

The  improvement  of  the  accuracy  of  the  central  scheme  for 
a  given  grid  by  using  a  matrix-valued  dissipation  instead  of 
a  scalar  dissipation  is  demonstrated  for  the  turbulent  flow 
around  the  RAE  2822  airfoil  [21].  Fig.  7  shows  the  com¬ 
parison  of  the  surface  pressure  distribution  for  scalar  and 
matrix  dissipation.  The  calculations  have  been  carried  out 
on  C-grids  with  160x32,  320x64  and  640x128  cells.  It  is 
obvious  that  the  quality  of  the  solution  obtained  with  the 
scalar  artificial  viscosity  model  on  the  320x64  and 
640x128  cells  can  already  be  achieved  with  the  matrix  dis¬ 
sipation  on  the  next  coarser  grid,  that  is  on  the  160x32  and 
320x64  grid,  respectively.  This  is  underlined  in  Fig.  8 
where  the  skin  friction  distributions  are  compared.  Fig.  9 
shows  the  variation  of  the  global  force  coefficients  with 
number  of  mesh  points  N=NX*NY.  In  contrast  to  the  scalar 
dissipation  model,  the  matrix  dissipation  provides  a  sec¬ 
ond-order  scheme  which  is  indicated  by  the  linear  depen¬ 
dency  of  the  integral  values  with  respect  to  the  total  num¬ 
ber  of  grid  points.  Also  this  figure  shows  that  the  results 
calculated  with  the  scalar  dissipation  model  on  the 
640x128  grid  is  already  obtained  on  the  320x64  grid  by  us¬ 
ing  the  matrix  dissipation  approach.  On  the  other  hand,  on 
a  given  grid  the  matrix  dissipation  model  requires  addi¬ 
tional  computational  costs  due  to  the  increased  complexity. 
Furthermore,  it  shows  a  degeneration  of  the  convergence 
behavior  to  steady  state  [20,21],  However,  for  a  specified 
level  of  accuracy  the  central  scheme  With  matrix  dissipa¬ 
tion  is  more  cost-effective  than  with  the  scalar  dissipation, 
since  coarser  grids  can  be  used.  Thus,  e.  g,  for  the  two-di¬ 
mensional  turbulent  flow  past  an  airfoil  the  computational 
effort  could  be  reduced  by  a  factor  of  2-3. 


3.4  Flux  Difference  Splitting 

Numerical  analysis  of  high  speed  flow  often  involves  the 
resolution  of  strong  shocks  producing  pressure  jumps  of 
considerable  strength,  complex  shock-shock  interactions, 
expansion  fans  and  contact  discontinuities  as  well  as  re¬ 
gions  of  highly  expanded  flow  as  e.  g.  on  the  leeside  of  re¬ 
entry  vehicles  at  high  angle  of  attack.  For  such  flows,  cer¬ 
tain  aspects  of  the  numerical  methods  which  perform  well 
for  sub-  and  transonic  applications  have  to  be  modified,  in 
order  to  facilitate  robust,  efficient  and  accurate  calcula¬ 
tions.  Classical  central  difference  schemes  are  not  well 
suited  to  such  flows,  since  they  require  excessive  artificial 
damping  in  order  to  suppress  high  frequency  oscillations 
which  may  grow  unbounded  in  the  vicinity  of  strong 
shocks.  This  has  led  to  the  development  of  a  variety  of 
upwind  schemes.  These  schemes  rely  on  local  wave  propa¬ 
gation  theory  for  the  differencing  of  the  convective  terms 
of  the  governing  equations  throughout  the  domain.  Promi¬ 
nent  representatives  of  this  class  of  algorithms  are  schemes 
based  on  the  ‘Flux  Difference  Splitting’  (e.  g.  [22,23]  and 
the’  Flux  Vector  Splitting’  (e.  g,  [24,25]  concept. 

Out  of  the  class  of  flux  difference  split  methods  we  have 
focused  on  the  upwind  TVD  discretization  according  to 
[23,26],  This  scheme  is  based  on  the  approximate  Riemann 
solver  of  Roe  [22]  and  uses  the  modified  flux  approach  of 
Harten  [23]  for  second-order  accuracy.  Upwinding  in 
multi-dimensions  is  performed  by  applying  the  one-dimen¬ 
sional  operator  successively  in  each  coordinate  direction. 
In  order  to  implement  an  approximate  Riemann  solver  in 
the  framework  of  a  node-based  finite  volume  scheme 
[5,27],  control  volumes  are  used  which  are  defined  by  con¬ 
necting  the  cell  centers  of  the  original  cell  (see  Fig.  10). 
The  convective  flux  Rjj  t;  for  the  control  volume  Vjj  ,;  in 
eq.  (3.1)  is  then  approximated  by 


Rti.k  =  R'  I  .  -R'  I  .  +R'  I 

i  +  :^.J.k  i-:;..l.k  i.J  +  ^.k 

-R'.  1  .  +  R  .  1-R"  |. 

i.J-;.k  I..|.k  +  -  i.j.k-- 


(3.18) 


C 


The  flux  R,  I  through  cell  face  \+\l2  is  given  as 

i  +  ^,j.  k  ■=  ® 


c 

R.  I  , 


I  S 


Fi  +  i.j,  k  +  Fi,j,  kj  - S.  I 


+  I  Q.  I  . 


(3.19) 


with  T  denoting  the  right  eigenvector  matrix  of  the  flux  Ja¬ 
cobian  in  the  i-direction  of  the  curvilinear  coordinate  sys¬ 
tem.  Eq.  (3.19)  separates  the  inviscid  numerical  flux  into 
the  sum  of  an  averaged  term  corresponding  to  central  diffe¬ 
rencing  and  a  dissipative  term,  which  adapts  the  discretiza¬ 
tion  stencil  in  accordance  with  local  wave  propagation,  Ac- 
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cording  to  Yee  and  Harten  [23],  the  n’th  component  q"  of 
the  flux  function  Q  ,  is  given  as 


T.  I  .  ,  7  ■  '  'n^  I 

'  +  rJ.i<  ^  i  +  ij.k 


-y  ^  +Y"  ,  a"  , 


1  '  1  ■  I 

i  +  ;[,j.k  I  i  +  r,j,k 


where  represents  the  n’th  eigenvalue  of  the  transformed 
flux  Jacobian  in  i-direction,  a"  j  denotes  the  differ¬ 
ence  of  characteristic  variables  ''^r '■*' 

a",  =  (T)->  ,  (  Wi+i,j,k-Wi.j,J  (3.21) 

i  +  k  i  +  ^.j.  k 


I  +  I,  j.  k  1.  j.  k 


0(3.22) 


if  a"  ,  =  0. 


The  function  \|/ ,  often  called  entropy  function,  prevents  the 
scheme  from  violating  the  entropy  condition  when  the 
wave  speeds,  vanish.  According  to  Harten,  this  function 
is  given  by 

IN  ifNNs 

V(^n)  =  m  I-  +  52  (3.23) 

where  0<5<0.5  is  a  suitable  chosen  parameter.  The  term  h" 
in  eqs.  (3.20)  and  (3.21)  represents  a  limiter  function  which 
brings  the  scheme  to  second  order.  Many  limiter  functions 
have  been  proposed  in  the  literature  (see  e.  g.  [26,28]).  In 
most  of  our  calculations  the  function 


h"  = 
‘^i.j.k 


(i,j,k)  and  (i+l,j,k).  The  fluxes  through  the  other  cell  faces 
are  evaluated  in  a  similar  manner. 

Setting  the  limiter  h"  identically  to  zero  reduces  this 
method  to  Roe’s  first-order  flux  difference  method.  It  has 
been  shown  that  the  scheme  is  TVD  (Total  Variation  Di¬ 
minishing)  for  one-dimensional  nonlinear  hyperbolic  sca¬ 
lar  equations  and  for  linear  constant  coefficient  systems.  It 
is  formally  second-order  accurate  except  at  shocks  where 
due  to  the  limiter  the  accuracy  is  reduced  to  first-order. 

For  viscous  flows  the  entropy  correction,  eq.  (3.23),  has  to 
be  carefully  designed.  The  shear  layers  along  solid  walls 
are  numerically  smeared,  if  an  entropy  correction  is  ap¬ 
plied  to  the  eigenvalues  associated  with  the  convective 
waves.  On  the  other  hand,  if  cells  with  high-aspect  ratios 
are  present,  additional  support  for  damping  in  the  direction 
of  the  long  side  of  a  cell  is  needed  in  regions  of  low  veloci¬ 
ties,  such  as  stagnation  points.  Therefore,  as  proposed  by 
Radespiel  and  Swanson  [29],  the  correction  is  constructed 
as  a  function  of  the  cell  aspect  ratio.  In  i-direction  the  cor¬ 
rection  for  the  linear  waves,  n=l,2,3  (see  eq.  (3.16))  is  de¬ 
fined  as 

[  Nl 

Y(^‘)  =  .  (3.25) 

1p^^  +  ('-P)NI  ifNI<5 

and  for  the  acoustic  waves,  n=4,5  it  is  given  by 


Y(>P)  = 


ifiV|<5. 


The  parameter  5  is  given  according  to  Muller  [30] 


-  I  (Hi™ 

5  =  5X'  1  +  max  ^ 

I  U'J  U'J 


where  X',  V,  X*^  are  the  spectral  radii  of  the  flux  Jacobians 
in  i-,  j-,  k-direction,  respectively  and  0<oxl.  The  blending 
coefficient,  (3,  accounts  for  the  cell  aspect  ratio.  It  is  given 


is  used  where  e  >  0  is  a  small  constant  to  prevent  division 
by  zero.  The  quantities  at  face  i-kl/2  are  evaluated  using  the 
Roe  averaged  state  [22]  involving  the  values  at  grid  nodes 


(3.28) 


It  has  been  shown  in  [29]  that  a  wide  range  of  flow  prob¬ 
lems  can  be  solved  accurately  with  a  single  set  of  parame¬ 
ters,  that  is  6=0.25  and  co=0.3. 

In  the  following  some  results  obtained  with  the  TVD 
scheme  are  shown  in  order  to  demonstrate  the  capability  of 
the  method.  Firstly,  in  Figs.  11-12  the  accuracy  is  displayed 
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for  the  turbulent  transonic  flow  around  the  RAE  2822  air¬ 
foil.  Calculations  have  been  carried  out  on  a  coarse  (80x16 
cells),  medium  (160x32  cells)  and  fine  grid  (320x64)  with 
C-grid  topology.  Fig.  11  shows  the  pressure  distributions 
and  skin  friction  distributions  along  the  surface  for  the 
three  different  grids.  It  is  seen  that  with  the  TVD  scheme  a 
grid-converged  solution  is  obtained.  The  difference  be¬ 
tween  the  medium  and  fine  grid  solution  is  very  small.  The 
improved  force  coefficients  obtained  with  the  upwind  TVD 
scheme  compared  to  the  classical  central  scheme  of  chapter 
(3.2)  is  shown  in  Fig.  12  where  lift  and  drag  values  are 
plotted  as  a  function  of  the  inverse  of  the  total  number  of 
cells. 

Next  the  laminar  flow  over  the  NACA  0012  airfoil  at 
-25  and  a=25°  is  chosen  as  a  test  case  to  demonstrate 
that  the  method  is  able  to  handle  very  strong  shock  waves 
and  highly  expanded  flow.  Fig.  13  shows  the  250x80  mesh. 
The  numerical  solution  is  represented  in  Figs.  14-17.  The 
streamlines  in  Fig.  15  feature  a  large  separated  flow  region 
with  two  distinct  vortices.  The  difficulties  in  resolving  this 
highly  separated  flow  are  illustrated  by  a  comparison  of  the 
distribution  of  skin  friction  and  Stanton  number  along  the 
airfoil  obtained  from  meshes  with  different  fine  grids.  It  is 
obvious,  that  the  grid  with  129x41  mesh  points  is  still  too 
coarse  to  resolve  the  separated  flow  region. 

The  third  viscous  test  case  presented  here  is  the  hypersonic 
laminar  flow  past  a  15°  compression  ramp.  With  onflow 
conditions  M^  =  11.68,  Rec=2. 47x10^,  T^=65K  and 

T^/T^  =4.604  it  corresponds  to  case  III. 4  of  the  Workshop 
on  Hypersonic  Flows  for  Reentry  Problems,  Part  II,  held  in 
Antibes,  France,  1991  [31].  Results  have  been  obtained  for 
three  successive  grids  [32].  The  Mach  contours  of  the  fine 
grid  with  288x224  cells  are  shown  in  Fig.  18.  The  pressure 
coefficient,  skin  friction  and  Stanton  number  are  displayed 
in  Figs.  19-21.  It  is  seen  that  almost  identical  solutions  are 
obtained  on  the  medium  and  fine  meshes.  Experimental 
data  of  Holden  [33]  are  also  plotted.  The  comparison  of  ex¬ 
perimental  and  theoretical  results  shows  that  the  calculated 
separation  extent  is  somewhat  larger  than  the  experimental 
result.  The  discrepancies  may  be  attributed  to  the  fact  that 
the  experimental  data  contain  3D  effects  which  are  not 
modeled  in  the  computation. 

As  a  last  test  case,  Edney’s  Type  IV  shock-interference 
flow  [34]  is  investigated.  This  flow  problem  demands  the 
solver  to  resolve  many  rigorous  flow  features  (see  Fig.  22) 
and  it  points  out  significant  differences  in  the  accuracy  and 
convergence  behavior  of  the  numerical  methods  [35]. 
Fig.  23  shows  the  Mach  contours  for  the  TVD  scheme  and 
the  classical  central  scheme  with  scalar  dissipation.  In- 
viscid  results  have  been  obtained  for  a  coarse  grid  with 
60x40  cells  and  a  fine  grid  with  120x80  cells  as  shown  in 
Fig.  23.  The  results  demonstrate  the  superior  resolution  of 
the  upwind  TVD  scheme.  As  anticipated,  the  additional 


dissipation  required  for  the  central  scheme  to  suppress  os¬ 
cillations  near  shocks,  considerably  smears  both  the  im¬ 
pinging  shock  and  the  distorted  bow  shock.  The  upwind 
method  sharply  resolves  these  features.  Moreover,  even  on 
the  coarse  mesh  the  internal  structure  of  the  field  is  cap¬ 
tured  including  the  imbedded  shock  and  terminating  nor¬ 
mal  shock.  In  contrast  to  that,  the  fine  grid  solution  ob¬ 
tained  with  the  central  difference  scheme  still  shows  a  lack 
of  structure. 

Many  two-  and  three-dimensional  applications  [27,29,37] 
have  shown,  that  the  upwind  TVD  scheme  provides  an  ac¬ 
curate  discretization  for  inviscid  and  viscous  flows.  Based 
on  our  experience,  however,  flux  difference  split  methods 
are  of  difficult  use  with  respect  to  robustness  and  parameter 
sensitivities  for  hypersonic  flow  fields  with  strong  expan¬ 
sions  into  regions  of  low  pressure  and  low  density  as  e.g. 
on  the  leeside  of  re-entry  vehicles  at  high  angle  of  attack. 
Moreover,  the  extension  of  flux  difference  split  methods  to 
non-equilibrium  flows  is  rather  complex, 

3.5  Flux  Vector  Splitting 

Upwind  methods  based  on  the  flux  vector  splitting  concept 
have  shown  to  be  efficient  and  robust  schemes  for  inviscid 
flows.  However,  often  they  exaggerate  diffusive  effects 
which  take  place  in  shear  and  boundary  layers.  Conse¬ 
quently,  substantial  effort  has  been  put  on  the  improvement 
of  flux  vector  split  methods  for  viscous  flows  [38,39,40]. 

A  remarkably  simple  upwind  flux  vector  splitting  scheme 
has  been  introduced  by  Liou  and  Steffen  [38,40].  It  treats 
the  convective  and  pressure  terms  of  the  flux  function  sepa¬ 
rately.  The  convective  quantities  are  extrapolated  to  the 
cell  interface  in  an  upwind-biased  manner  using  a  properly 
defined  cell  face  advection  Mach  number.  Accordingly,  the 
scheme  is  called  Advection  Upstream  Splitting  Method 
(AUSM).  Results  for  simple  flow  problems  given  by  Liou 
[39,41]  have  shown  that  AUSM  retains  the  robustness  and 
efficiency  of  the  flux  vector  splitting  schemes  but  it 
achieves  the  high  accuracy  attributed  to  schemes  based  on 
the  flux  difference  splitting  concept.  The  computational  ef¬ 
fort  for  the  flux  evaluation  is  only  linearly  proportional  to 
the  number  of  unknowns,  as  in  the  case  of  central  differen¬ 
cing.  Furthermore,  the  scheme  can  be  easily  extended  to 
real  gas  calculations.The  application  to  various  relevant 
flow  problems,  however,  has  shown  [36,42,43,44]  that  the 
original  flux  vector  splitting  method  of  Liou  and  Steffen 
has  several  deficiencies.  It  locally  produces  pressure  oscil¬ 
lations  in  the  vicinity  of  shocks.  Furthermore,  the  scheme 
has  a  poor  damping  behavior  for  small  Mach  numbers 
which  leads  to  spurious  oscillations  in  the  solution  and  af¬ 
fects  the  ability  of  the  scheme  to  capture  flows  aligned  with 
the  grid  coordinates. 

In  the  present  paper  several  modifications  to  the  original 
advection  upstream  splitting  method  of  Liou  and  Steffen 
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are  proposed  which  substantially  improve  the  scheme's 
ability  to  predict  viscous  flows  accurately.  In  particular,  a 
hybrid  method  is  introduced  which  switches  from  AUSM 
to  van  Leer  scheme  at  shock  waves.  This  ensures  the  well- 
known  sharp  and  clean  shock  capturing  capability  of  the 
van  Leer  scheme  and  the  high  resolution  of  slip  lines  and 
contact  discontinuities  through  AUSM.  An  adaptive  dissi¬ 
pation  is  introduced  in  order  to  achieve  sufficient  numeri¬ 
cal  damping  in  cases  of  adverse  grid  situations  and  flow 
alignment.  Furthermore,  the  MUSCL  implementation  for 
higher-order  accuracy  is  modified  to  allow  a  more  accurate 
scaling  of  the  numerical  dissipation  in  boundary  layers 
where  the  contravariant  Mach  number  is  usually  small  in 
the  wall-normal  direction.  The  improved  accuracy  of  the 
modified  scheme  is  demonstrated  by  the  calculation  of 
two-  and  three-dimensional  inviscid  and  viscous  flows. 

As  shown. in  [39,36],  the  discrete  inviscid  flux  Ri+i/2,j,k 
through  cell  face  i-Hl/2  (see  eq.(3.18))  can  be  interpreted  as 
a  sum  of  a  Mach  number  weighted  average  of  the  left  (L) 
and  right  (R)  state  at  the  cell  face  i+1/2  (see  Fig.  3.10)  and 
a  scalar  dissipative  term.  It  reads 


MP 


^(M  +  1)2  if  |M|<  1 
,0 


(3.32) 


if  M>1 

M*"  =  -i(M-l)2  if|M|<l  (3.32) 

„  if  M<-1 

-  M 

Ml  and  Mr  denote  the  Mach  number  associated  with  the 
left  and  right  state,  respectively.  The  advection  Mach  num¬ 
ber  is  given  by 


((s'^u  +  s'yV  +  s'^w)) 
c 


(3.33) 


The  pressure  p  at  cell  face  i  -r  1/2  is  calculated  in  a  similar 
way  as 


-♦c  l^il  1  ^ 

R  I  .  =  ls|.  1  .  iM  ,  pcv 

■  +  2,J.k  i  +  ^.J.  k  2  i  +  ij,  k 

2  nru.^ 


pc  pc 

pcu  pcu 


o'*'  1  pcv  pcv 

^  pew  pew 

,  pcH  R  pcH 


pc  pc 

pcu  pcu 

pcv  +  pcv 
pew  pew 

.phJl  LpH, 


pcv  “  pcv  +  Syp 


0  i+ifk 


P-  I  ■ . 


pC  +  Pl 


where  p'’  denote  the  split  pressure  defined  according  to 
[25] 

P  if  M>I 

pP  =  ^p(M-k  1)2(2-M)  if  1M|  <  1 
if  M<-1 


if  M>1 

p™  =  ^p(M-  l)2(2-kM)  if  |M|  <  1 


S‘^,  =  [(sVsVs^)]T  _ 


denotes  the  surface  vector  normal  to  the  cell  face  i  -r  1  /2  . 
The  quantity  c  represents  the  speed  of  sound.  M-^  j/2  j  k 
denotes  the  advection  Mach  number  at  the  cell  face  i  +  1/2 
which  is  calculated  according  to  [39]  as 


M[  +  M™ 


The  definition  of  the  dissipative  term  O  determines  the 
particular  flux  vector  splitting  formulation.  A  hybrid 
scheme  is  proposed  here  [45],  which  combines  the  van 
Leer  scheme  and  the  scheme  of  Liou  and  Steffen  (AUSM). 
It  reads 


])  ,  =  (1-03)  (1)^^  +  (0 ■  (t)"'Pf‘'^’JSM  (336) 

i  +  ;;,j,k  i  +  ;,j,k  i  +  ;,j,k 


where  the  split  Mach  numbers  MP^"*  are  defined  as  [25] 
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1  +:;(Mr->)^  if  0<M  I  < 

i  +  ^.j'k  2  i  +  j.j.  k 


|M  I  I  +  HMl+U^  if-l<M  ,  <0 


|,modAUSM  _ 
i  +  j.j.  k 


'  +  yj’  k  if  IM  .  I  >  6 

l  +  f,j,  k 

M  ,  K52  '  -  '  -  (3.38) 

i  +  i,i.k  if  M  ,  <5 

^  ^  i  +  f  i.  k 


with  adverse  grid  situations  and  flow  alignment,  its  dissipa¬ 
tive  term  has  been  modified.  As  it  can  be  seen  in  eq.(3.38), 
controlled  dissipation  is  locally  introduced  for  small  advec- 
tion  Mach  numbers,  preventing  the  dissipative  term  from 
approaching  zero  as  the  Mach  number  tends  to  zero.  In 
Fig.  24  the  dissipative  term  <])  is  plotted  as  a  function  of 
Mach  number.  Note,  that  for  simplicity  Ml  ~  Mr  is  as¬ 
sumed,  which  is  valid  at  least  in  the  vicinity  of  M=0  on  a 
sufficiently  fine  computational  grid. 

Accurate  and  efficient  calculations  of  viscous  flows  require 
computational  grids  with  high-aspect  ratio  cells.  Therefore, 
the  dissipation  term  of  the  improved  AUSM  for  small  ad- 
vection  Mach  numbers  (eq.  (3.38))  has  to  be  properly 
scaled  in  order  to  avoid  smearing  of  the  shear  layers  in 
wall-normal  direction.  As  mentioned  in  [42],  this  is  real¬ 
ized  by  defining  the  parameter  5  in  eq.  (3.38)  not  as  a  con¬ 
stant  but  as  a  function  of  the  cell  metric 


where  5  is  a  small  parameter  0  <  6  <0.5  and  CO  is  a  con¬ 
stant  0  <  (0  <  1 . 

The  above  equations  clearly  show  that  for  a  supersonic  cell 
face  Mach  number  the  hybrid  scheme  represents  a  pure 
upwind  discretization,  using  either  the  left  or  right  state  for 
the  convective  and  pressure  terms,  depending  on  the  sign 
of  the  Mach  number.  For  (O=0  the  method  reduces  to  the 
classical  van  Leer  flux  vector  splitting  scheme.  In  the  case 
of  (0  =  1  and  5=0  the  original  AUSM  developed  by  Liou 
and  Steffen  is  recovered.  Comparing  both  fluxes  it  is  obvi¬ 
ous  that  the  van  Leer  scheme  is  more  dissipative  than 
AUSM  (6=0).  It  has  an  additional  Mach  number  scaled 
dissipative  term  which  does  not  vanish  even  for  M=0.  Con¬ 
sequently,  the  van  Leer  scheme  is  more  robust  but  less  ac¬ 
curate  than  the  original  scheme  of  Liou  and  Steffen,  espe¬ 
cially  for  viscous  flow  calculations. 

The  hybrid  flux  has  been  introduced  in  order  to  ensure 
both,  the  clean  and  sharp  shock  resolution  of  the  van  Leer 
scheme  and  the  low  diffusive  solution  of  AUSM  in  smooth 
regions.  This  is  realized  by  relating  the  parameter  co  to  the 
second  difference  of  the  pressure, 

CO  =  max(p.  .  ,)  , 

v.  .  .  =  max  1  -  a  ^i-i.i.  k  ^Pi,j.  k  "k  Pi+  ij,  k  q1 

I  PM,j,k  +  2p„i.k  +  P..i.j,k  J  (3.39) 
a  =  0(5). 


The  value  of  co  is  1  in  smooth  regions  and  switches  to  0  in 
the  vicinity  of  shocks.  Moreover,  in  order  to  improve  the 
damping  behavior  of  the  original  AUSM  (5=0)  in  regions 


where  5  is  a  small  constant,  0<5<0.5,  and  p  is  a  scaling 
function.  It  may  be  given  by 


T'biisiJ  i  ■ 


In  the  above,  |si,is''|,|s'^|  represent  the  surface  areas  asso¬ 
ciated  with  the  i-,j-,k-direction  of  the  body-fitted  coordi¬ 
nate  system,  respectively.  The  scaling  function  (3  in  j-  and 
k-direction  is  defined  in  a  similar  way.  With  this  scaling, 
controlled  adaptive  dissipation  can  be  introduced,  which 
on  the  one  hand  improves  the  damping  behavior  of  AUSM 
in  adverse  grid  situations  but  on  the  other  hand  does  not 
spoil  the  accuracy  of  the  method  for  boundary  layer  calcu¬ 
lations.  It  is  obvious  from  eqs.  (3.38)-(3.41)  that  additional 
dissipation  as  a  function  of  the  grid  aspect  ratio  is  fed  in 
only  alcing  the  long  sides  of  the  cell,  that  is  if  the  cell  face 
area  Is]  is  smaller  than  the  areas  Isi  and  |s''| .  In  the  con¬ 
trary,  if  the  cell  face  area  |s'|  is  larger  than  areas  Is^  and 
|s'‘| ,  as  typical  in  wall-normal  direction,  the  original  non 
smearing  dissipation  of  AUSM  is  recovered. 

An  alternate  scaling  function  is  given  by 


iiirisi 


This  function  leads  to  a  constant  5  =  5  along  the  long  side 
of  the  cell,  whereas  in  the  wall-normal  direction  the  dissi¬ 
pative  coefficient  is  weighted  by  the  cell  aspect  ratio.  It  is 
obvious  that  along  the  short  cell  face  the  dissipation  is  re¬ 
duced  as  the  cell  aspect  ratio  increases. 
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Another  possibility  for  the  scaling  of  the  adaptive  dissipa¬ 
tion  is  to  use  the  local  flow  quantities  instead  of  the  metric 
terms.  In  this  case  the  function  (3  is  defined  as 


j.k 


(3.43) 


where  are  the  spectral  radii  of  the  inviscid  flux  Ja- 

cobians  in  the  i-,  j-,  k-coordinate  direction,  respectively. 
The  scaling  function  eq.  (3.43)  also  introduces  additional 
damping  in  the  direction  of  the  long  side  of  the  cell.  In  the 
wall-normal  direction  again  only  a  small  amount  of  dissi¬ 
pation  is  allowed. 

The  spatial  accuracy  of  the  improved  flux  vector  split 
scheme  depends  on  the  determination  of  the  left  and  right 
state  at  cell  interfaces.  For  a  first-order  scheme  the  flow 
quantities  at  the  left  and  right  state  are  given  by  their  values 
at  the  neighboring  mesh  points,  i.e.  i,j,k  and  i+i,j,k,  respec¬ 
tively,  (see  Fig.  10).  Higher-order  accuracy  is  obtained 
with  the  MUSCL  approach  in  the  present  work.  MUSCL 
uses  extrapolation  of  flow  quantities  for  the  calculation  of 
the  left  and  right  states.  With  this  approach  several  deci¬ 
sions  must  be  taken  which  affect  the  ability  of  the  scheme 
to  capture  strong  shocks  and  viscous  shear  layers  aligned 
with  the  coordinate  grids.  These  are  the  choice  of  the  flow 
variables  to  be  extrapolated  to  the  cell  face  and  the  choice 
of  the  extrapolation  function  which  gives  higher-order 
fluxes  in  smooth  regions  of  the  flow.  At  discontinuities  the 
function  switches  to  first-order  accuracy  in  order  to  guaran¬ 
tee  shock  capturing  without  spurious  oscillation.  Here,  the 
van  Albada  limiter  function  is  chosen  according  to  [46] 


I  (A^  -I-  e)  A_  +  (A:  +  £)  A^ 
“i.j.  k'*'2  A^  4- A^  -H  2e 


(3.44) 


with 


'^+  -  “i  +  i.j,  k  “i.j.  k 
A_  =  j 

where  u^  denotes  the  flow  quantity  u  of  the  left  state  to  be 
extrapolated  to  the  face  i-i-1/2.  The  right  state,  u^,  is  evalu¬ 
ated  similarly  by  using  the  data  of  points  (i,j,k),  (i+l,j,k), 
(i+2,j,k).  This  limiter  function  is  equivalent  to  Fromm’s 
scheme  in  smooth  regions  of  the  flow  where  the  gradients 
squared,  A^  ,  A^  ,  are  small  compared  to  e.  In  [46]  the 
quantity  e  is  used  in  order  to  suppress  limiting  of  the  solu¬ 
tion  in  regions  where  the  flow  is  nearly  constant.  This  is  ac¬ 
complished  by  taking 

£  =  K|Ax"  (3.45) 


where  Ax  denotes  the  distance  between  the  grid  points  i,j,k 
and  i+l,j,k.  k,  is  an  empirical  constant  of  0(10)  and 
2<n<3.  Note,  that  one  can  only  expect  eq.  (3.45)  to  work 
well  when  solving  the  flow  equations  in  their  nondimensio- 
nal  form.  Eq.  (3.45)  can  be  extended  to  suppress  limiting 
the  fluxes  within  boundary  layers.  Not  only  does  limiting 
in  the  wall-normal  direction  degrade  accuray  on  coarse 
meshes  but  it  may  also  introduce  spurious  oscillations  in 
the  solution  as  seen  in  Fig.  25a).  Here,  we  encounter  the 
situation  that  the  cartesian  velocity  components,  u  and  v, 
are  nonzero  but  the  contravariant  velocity  component  in 
wall-normal  direction  is  close  to  zero.  Limiting  the  extrap¬ 
olation  of  u  and  v  individually,  as  it  is  standard  practise  in 
most  MUSCL  implementations  [47],  may  result  in  false 
values  for  Ml  and  Mr  which  define  the  inherent  dissipa¬ 
tion  of  the  split  flux  (eq.  (3.29)).  This  problem  is  resolved 
by  defining 


£  =  max 


(K.Ax")  ,kJ  (pmodAUSM 
2 


(3.46) 


where  k  2=0(1 00),  ^modAUSM  jg  evaluated  according  to 
eq.  (3.38)  with  6  =0(0.1),  and  M  is  the  average  of  the 
contravariant  Mach  numbers  at  points  i,j,k  and  i+l,j,k. 
Fig.  3.25a  demonstrates  that  oscillations  in  wall-normal  di¬ 
rection  are  completely  removed  by  using  eq.  (3.46)  instead 
of  (3.45).  Note,  that  this  type  of  oscillations  does  not  occur 
in  the  higher-order  results  published  in  [39].  This  may  be 
explained  by  the  fact  that  the  viscous  test  cases  selected  in 
[39]  used  cartesian  meshes  where  the  cartesian  velocity  v  is 
equal  to  the  corresponding  contravariant  velocity  compo¬ 
nent.  For  this  special  case  eq.(3.45)is  sufficient  in  order  to 
obtain  proper  dissipative  terms. 

It  should  also  be  mentioned  that  the  second-order  interpo- 
lant  in  eq.  (3.44)  may  be  replaced  by  the  third-order  for¬ 
mula  of  [48].  This  alternative  yields  somewhat  more  accu¬ 
rate  results  for  transonic  and  supersonic  flows  but  it  is  less 
robust  for  hypersonic  flows  with  strong  shocks. 

The  selection  of  flow  variables  for  the  extrapolation  pro¬ 
cess  is  described  next.  Initially,  we  tried  some  standard 
choices,  these  are  the  use  of  primitive  or  conserved  flow 
variables  for  extrapolation.  It  turned  out  that  the  latter 
choice  is  not  robust  at  transient  shock  waves,  whereas  the 
former  tends  to  support  oscillations  in  stagnation  point  re¬ 
gions  behind  strong  shocks.  Furthermore,  either  choice 
does  not  allow  inviscid  steady  state  solutions  with  constant 
total  enthalpy.  Constant  total  enthalpy  in  the  steady  state 
can  be  obtained  if  the  energy  flux  in  eq.  (3.29)  is  formed 
with  total  enthalpy  H  being  an  extrapolated  quantity.  How¬ 
ever,  recalculation  of  the  pressure  p  in  eq.  (3.29)  from  a 
single  set  of  flow  variables  including  H  does  not  yield 
nonoscillatory  fluxes  for  the  momentum  equation.  Further 
numerical  experiments  showed,  that  extrapolation  of  the 
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primitives  for  mass  and  momentum  fluxes  combined  with 
extrapolation  of  H  in  order  to  compute  the  energy  flux  re¬ 
sults  in  nonoscillatory  flow  solutions  and  superior  conver¬ 
gence  behavior.  This  numerical  treatment  corresponds 
closely  to  the  underlying  design  principle  of  AUSM,  which 
splits  the  flux  vector  into  an  advective  and  a  pressure  part. 
In  the  computations  of  3D  hypersonic  flow  problems,  very 
strong  shocks  may  occur  in  regions  of  strong  variations  of 
the  grid  metrics.  For  these  cases  shock  resolution  is  further 
improved  by  modifying  the  limiter  function,  eq.  (3.44),  as 


]  (A^  -i-e)  A_  +  (A3  +e)  A^ 


A^  -I-  A^  +  2e 


(3.47) 


with  the  pressure  switch  v  ,  given  by  eq.(3.39).  Addition¬ 
ally,  the  contravariant  Mach  numbers,  Ml  and  Mr,  are  ob¬ 
tained  by  extrapolation  of  the  contravariant  velocity  com¬ 
ponent.  More  specifically.  Ml  at  cell  face  i+1/2  is 
computed  by  taking 


M 


L 


(3.48) 


where  Cl  denotes  the  speed  of  sound  associated  with  the 
left  state  and  is  the  contravariant  velocity  which  is 

evaluated  with  the  help  of  eq.(3.47)  and 


J'i+l/2,j,  k  ■  Qi, 


(q„) 


j.k 


i.j.  k 


i  +  I  /2.  j.  k 


(3.49) 


+  I.  j,  k  Qi,  j,  k) 


^i+  1/2,  j.k 


111: 


(3.50) 


•  1/2.  i.  k 


A  =  (qi.j.k-qi-i.j,k) 


ai-i/2.j.  k 
li  -  1/2.  i.  k 


(3.51) 


Here,  q  =  [u,  v,  w]  t  is  the  vector  of  cartesian  velocities. 

In  the  following,  numerical  results  for  inviscid  and  viscous 
flows  obtained  with  the  improved  advection  upstream  split¬ 
ting  method  are  presented.  Emphasis  is  put  on  the  method’s 
capability  to  resolve  wall-normal  gradients  of  flow  quanti¬ 
ties  which  for  instance  occur  in  entropy  and  boundary  lay¬ 
ers.  As  test  cases  the  inviscid  flow  around  a  blunt  slender 
cone  and  viscous  2D  flows  are  selected, 

Inviscid  calculations  around  a  blunt  slender  cone  [49]  at 
freestream  Mach  number  =8  and  angle  of  attack  a=0° 
have  been  carried  out.  The  curved  bow  shock  detached 
from  the  blunt  nose  produces  a  thick  entropy  layer  in  the 
front  part  of  the  configuration  which,  however,  develops  to 
a  very  thin  layer  in  the  rear  part.  Since  the  quality  of  the 


numerical  results  strongly  depends  on  the  resolution  of  the 
entropy  layer,  computational  methods  have  to  be  used 
which  accurately  predict  this  flow  feature. 

The  grid  used  for  the  calculations  is  shown  in  Fig.  26.  The 
C-0  topology  has  been  chosen  with  161x41x31  grid  points 
in  i-,j-,k-direction,  respectively  [50].  21  grid  points  were 
used  to  discretize  the  spherical  nose  shape  in  streamwise 
direction.  In  i-  and  j-direction  a  linear  stretching  of  the  grid 
spacing  was  introduced.  This  allows  a  suitable  grid  distri¬ 
bution  with  respect  to  computational  efficiency.  The 
stretching  in  j-direction  provides  enough  grid  points  in  the 
near-wall  region  necessary  to  resolve  the  thin  entropy  layer 
in  the  rear  part  of  the  configuration.  Fig.  27  shows  Mach 
number  and  pressure  contours  in  the  nose  region  obtained 
with  the  improved  flux  splitting  method.  The  flow  field  is 
axi-symmetric  since  the  angle  of  attack  has  been  set  to 
zero.  In  order  to  check  the  accuracy  of  the  scheme,  in  Fig. 
3.28  the  entropy  value  at  the  wall  is  plotted  along  the  body 
in  streamwise  direction.  Since  for  inviscid  flows  the  body 
surface  is  part  of  the  stagnation  streamline,  the  entropy  is 
constant  along  the  body.  Its  value  is  determined  through 
the  entropy  raise  across  the  normal  shock.  In  Fig.  28  nu¬ 
merical  results  obtained  with  the  improved  AUSM  and 
with  the  classical  van  Leer  scheme  are  depicted.  In  addi¬ 
tion,  the  analytical  entropy  value  at  the  wall  is  given.  In  the 
front  part  of  the  configuration  (almost  up  to  100  nose  radii) 
the  error  of  AUSM  is  less  than  1%.  In  the  rear  part,  how¬ 
ever,  the  accuracy  is  decreasing.  This  may  be  attributed  to 
the  computational  grid,  which  in  this  part  of  the  configura¬ 
tion  is  not  sufficiently  fine  to  resolve  the  thin  entropy  layer 
as  accurately  as  in  the  front  part.  It  should  be  noted  that  for 
this  calculation  the  scaling  function  eq.  (3.42)  with  5  =  0.1 
has  been  used  to  control  the  dissipative  term.  Computa¬ 
tions  with  the  other  scaling  functions  or  with  different  pa¬ 
rameters  5  did  not  improve  the  results.  As  it  can  be  seen  in 
Fig.  28,  the  classical  van  Leer  scheme  produces  less  accu¬ 
rate  results  along  the  whole  configuration.  This  demon¬ 
strates  that  on  a  given  grid  the  improved  flux  splitting 
method  is  less  diffusive  compared  to  the  van  Leer  scheme 
and  therefore  more  qualified  for  the  accurate  resolution  of 
entropy  layers. 

Several  two-dimensional  viscous  flow  problems  serve  to 
demonstrate  the  ability  of  the  new  flux  vector  split  scheme 
to  resolve  viscous  shear  layers.  We  have  chosen  transonic 
and  hypersonic  test  cases  which  are  well  known  from  liter¬ 
ature. 

The  first  test  case  is  the  transonic  turbulent  flow  over  the 
RAE  2822  airfoil  at  M„=0.73,  ot=2.79°,  Re=65xl0^  The 
computational  grid  consists  of  320x64  cells.  Flow  compu¬ 
tations  were  carried  out  with  explicit  multi-stage  time  step¬ 
ping  and  multigrid  with  full  coarsening.  A  typical  conver¬ 
gence  history  is  displayed  in  Fig.  29.  Computing  time  was 
reduced  by  full  multigrid,  that  is,  coarse-mesh  solutions  on 
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grids  with  80x16  cells  and  160x32  cells  were  obtained  with 
each  100  multigrid  iterations  in  order  to  produce  the  initial 
solution  on  the  next  finer  grid.  An  impression  of  the  overall 
flow  field  is  provided  by  Fig.  25b.  The  improved  AUSM 
yields  a  clean  resolution  of  the  shock  and  the  boundary  lay¬ 
ers.  Fig.  30  compares  the  distributions  of  skin  friction 
yielded  by  AUSM  and  van  Leer  scheme  under  grid  refine¬ 
ment.  There  is  a  dramatic  improvement  of  resolution  visi¬ 
ble  for  the  improved  AUSM.  Not  only  does  the  improved 
resolution  of  shear  layers  affect  friction  drag  of  the  airfoil 
but  also  the  pressure  forces  due  to  viscous/inviscid  interac¬ 
tion.  This  is  demonstrated  in  Fig.  3 1  where  lift  and  drag 
values  are  plotted  as  a  function  of  the  inverse  of  the  total 
number  of  cells,  N.  The  results  of  the  high-resolution 
upwind  TVD  scheme  are  included  for  comparison.  The 
smeared  boundary  layers  of  van  Leer  ‘s  scheme  affect  the 
interaction  with  the  shock  in  that  the  shock  location  moves 
upstream  (not  shown  here).  Consequently,  lift  is  underpre¬ 
dicted  by  van  Leer's  scheme  as  compared  to  AUSM  and 
Upwind  TVD.  The  improved  AUSM  is  best  for  the  predic¬ 
tion  of  pressure  drag  whereas  AUSM  and  Upwind  TVD  do 
similarly  well  for  skin  friction  drag.  The  relatively  large 
values  of  pressure  drag  for  the  upwind  schemes  on  coarse 
meshes  as  compared  to  those  for  the  central  differencing 
plus  matrix-valued  dissipation  given  in  Fig.  12  are  caused 
by  the  effect  of  the  flux  limiter  in  the  nose  region  of  the  air¬ 
foil.  This  effect  disappears  for  subsonic  cases  when  the 
flux  limiting  is  switched  off.  The  construction  of  a  limiter 
function  which  is  only  active  at  shocks  and  does  not  adver- 
sly  affect  smooth  flow  regions  is  still  unresolved. 

The  next  viscous  2D  test  case  presented  here  is  the  hyper¬ 
sonic  laminar  flow  past  a  15°  compression  ramp.  The  on¬ 
flow  conditions  correspond  to  case  III. 4  of  the  Workshop 
on  Hypersonic  Flows  for  Reentry  Problems  held  in  An¬ 
tibes,  1991  [31].  The  grid  consists  of  288x224  cells.  In 
Fig.  32  the  Mach  contours  are  plotted.  Results  obtained 
with  the  second-order  TVD  scheme  and  the  second-order 
improved  AUSM  with  scaling  eq.  (3.42)  and  5=0.05  are 
presented.  There  are  no  major  differences  between  the  re¬ 
sults  of  the  different  schemes  visible.  This  statement  is 
supported  by  the  plots  of  the  pressure  coefficient  the  skin 
friction  coefficient  and  Stanton  number  along  the  wall  in 
Fig.  33.  Only  slight  differences  occur  in  the  skin  friction 
coefficient  and  the  Stanton  number.  As  in  the  previous  test 
case,  the  scaling  of  the  dissipative  term  in  the  modified 
AUSM  has  no  influence  on  the  result  on  this  very  fine  grid. 
These  calculations  demonstrate  that  the  improved  flux  vec¬ 
tor  split  method  predicts  viscous  flows  as  accurate  as  the 
TVD  flux  difference  splitting  scheme.  For  the  viscous  test 
cases  presented  here  almost  no  differences  in  the  results 
have  been  observed  for  the  different  scaling  functions 
which  have  been  proposed  for  a  proper  scaling  of  the  dissi¬ 
pative  term.  Compared  to  the  TVD  scheme  the  conver¬ 


gence  behavior  of  the  modified  AUSM  scheme  is  slightly 
worse.  However,  due  to  the  reduced  computational  effort 
per  time  step,  the  overall  efficiency  of  both  methods  is 
comparable.  Since  in  contrast  to  the  TVD  scheme  the  nu¬ 
merical  effort  of  AUSM  is  proportional  to  the  number  of 
unknowns,  substantial  reduction  of  the  computational  cost 
can  be  expected  for  3D  calculations  and  also  for  solutions 
of  flow  problems  with  additional  conservation  equations. 
Computations  of  complex  3D  viscous  flows  over  a  winged 
reentry  vehicle  including  deflected  control  surfaces  and 
multiblock  computations  of  the  flow  through  the  slot  be¬ 
tween  different  control  surfaces  (see  chapter  5)  demon¬ 
strated  the  usefulness  of  the  present  discretization  for  gen¬ 
eral  3-D  applications.  AUSM  enables  us  to  compute  flows 
with  very  strong  shocks  and  strong  expansions  into  leeside 
flow  regions,  which  were  impossible  with  flux  difference 
split  methods. 


3.6  Viscous  Terms 

For  the  approximation  of  the  Navier-Stokes  equations  all 
schemes  presented  in  the  previous  sections  rely  on  the 
same  central  discretization  of  the  viscous  terms.  The  vis¬ 
cous  fluxes  required  to  determine  the  solution  at  point 
(i,j,k)  are  approximated  using  the  auxiliary  cell  shown  in 
Fig.  3.10.  They  contain  first  derivatives  of  the  flow  vari¬ 
ables,  which  are  computed  using  a  local  transformation 
from  cartesian  coordinates  to  the  curvilinear  coordinates 
q,  ^  [14].  For  an  arbitrary  flow  quantity  one  obtains 


9fj  9)^  ai;  9x  ■ 


(3.52) 


The  derivatives  and  q^  are  evaluated  employing 

central  finite  differences,  whereas  the  cell  face  vectors  and 
the  volume  are  used  to  compute  the  metric  derivatives. 

In  high  Reynolds  number  flows  with  thin  viscous  shear 
layer  the  flow  gradients  in  the  direction  normal  to  the  wall 
are  much  larger  than  those  along  the  wall.  This  fact  allows 
a  simplified  approximation  of  the  viscous  terms,  called  thin 
layer  approximation.  Using  a  body  fitted  mesh,  there  is  one 
family  of  grid  lines  almost  parallel  to  the  wall  and  another 
one  approximately  normal  to  it.  If  the  thin  layer  is  to  be  re¬ 
solved  accurately  and  if  the  number  of  points  is  to  be  kept 
within  a  limit  which  is  tolerable  to  todays  supercomputers, 
highly  stretched  grids  in  wall-normal  direction  are  used. 
On  such  grids  one  cannot  expect  the  viscous  terms  in  stre- 
amwise  direction  to  be  resolved  accurately.  Therefore,  with 
the  thin  layer  approximation  all  the  viscous  contributions 
arising  form  gradients  in  the  direction  of  the  quasi-stream- 
wise  coordinates  are  neglected.  In  all  viscous  applications 
shown  in  this  paper  the  thin  layer  approximation  has  been 
used. 
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4.  EFFICIENT  ALGORITHMS  FOR  THE 
COMPUTATION  OP  STEADY-STATE  SOLUTIONS 
As  numerical  flow  simulations  pave  their  way  into  the 
practical  aerodynamic  design  process,  the  need  for  efficient 
methods  to  solve  the  equations  governing  inviscid  and  vis¬ 
cous  flows  has  become  very  obvious.  Many  solvers  still 
used  in  current  aerospace  development  programs  exhibit 
slow  convergence  towards  the  desired  steady-state  solu¬ 
tions,  which  leads  to  high  computer  costs  and  long  turn¬ 
around  times.  Consequently,  there  is  a  substantial  amount 
of  research  work  focused  on  methods  for  convergence  ac¬ 
celeration.  One  of  the  promising  approaches  is  the  multi¬ 
grid  method.  Multigrid  which  uses  a  sequence  of  succes¬ 
sively  coarser  meshes  in  order  to  propagate  and  damp 
disturbances  throughout  the  flow  field,  was  initallly  in¬ 
vented  and  analyzed  for  the  solution  of  elliptic  partial  dif¬ 
ferential  equations  by  A.  Brandt  [51].  Later,  the  idea  was 
successfully  applied  to  purely  hyperbolic  or  mixed  systems 
of  equations  in  fluid  mechanics,  even  though  the  mathe¬ 
matical  backing  of  these  extensions  is  still  incomplete. 

4.1  Multigrid  Approach 

To  set  the  stage  for  the  discussions  of  multigrid  in  subse¬ 
quent  parts  of  the  chapter  we  first  describe  the  multigrid 
method  and  some  means  to  analyze  its  performance. 

4.1.1  Definition  of  Multigrid  Components 

The  multigrid  method  deals  with  a  sequence  of  meshes 
which  differ  by  their  density  of  grid  points.  They  may  be 
created  by  successively  deleting  every  second  grid  line  in 
all  coordinate  directions.  By  this,  4-7  coarse  meshes  are 
generated  for  practical  flow  problems.  Here,  we  will  de¬ 
scribe  an  arrangement  of  a  fine  mesh  with  index  f  and  a 
coarse  mesh  with  index  c.  The  semi  discretization  of  chap¬ 
ter  3  on  the  fine  mesh  can  be  written  as 

=  (4.1) 

where  Wf  is  the  solution  vector,  Rf  represents  the  discrete 
flux  balance,  and  Vf  is  the  discrete  volume  around  the  grid 
point.  The  fine-mesh  solution,  Wf,  may  be  improved  by 
numerically  advancing  eq.  (4.1)  in  time,  which  is  called 
smoothing  in  multigrid  terminology.  Practical  smoothing 
schemes  based  on  explicit  and  implicit  time  stepping  are 
discussed  in  chapter  4.2.  In  order  to  improve  the  solution 
on  the  fine  grid  with  the  aid  of  a  coarse  grid,  a  series  of 
steps  are  carried  out  as  follows. 

Both  the  solution  vector  and  the  residual  vector  are  trans¬ 
ferred  to  the  coarse  mesh.  Simple  injection 


at  the  coincident  grid  point  is  used  for  transfer  of  the  solu¬ 
tion.  In  order  to  ensure  conservation  property  for  the  resid¬ 
ual  transfer,  full  weighting  according  to  [51]  is  applied  as 

lfR{  =  (4  3) 

and  ,  |iy ,  are  the  standard  averaging  operators  in 
curvilinear  coordinate  directions.  Note,  that  the  transferred 
residual,  Rf ,  should  be  based  on  the  most  recent  solution, 
Wf,  in  order  to  obtain  best  efficiency  of  the  overall 
method.  The  restricted  residual  is  used  to  define  a  forcing 
function  for  the  coarse  mesh 


If4f-Rc 


as  the  difference  between  the  restricted  residual  and  the 
coarse-grid  residual  calculated  with  the  injected  solution. 
The  use  of  the  forcing  function  eq.  (4.4)  is  necessary  if  we 
want  to  solve  eq.  (4.1)  on  the  coarse  mesh  in  order  to  ob¬ 
tain  corrections  for  the  solution  on  the  fine  mesh.  The 
smoothing  scheme  is  then  used  to  solve 


Note,  that  during  the  first  numerical  update  of  eq.  (4.5)  on 
the  coarse  mesh  the  coarse-grid  residual  R^  drops  out.  This 
ensures  zero  corrections  from  the  coarse  mesh  if  the  re¬ 
stricted  residual  from  the  fine  mesh,  l=Rf ,  vanishes  in  the 
steady  state. 

Execution  of  one  or  several  time  steps  on  the  coarse  mesh 
yields  corrections  of  the  form 


=  Wc-Wc 


where  the  superscripts  denote  the  discrete  time  level.  The 
correction  is  then  transferred  to  the  fine  grid  which  is  called 
prolongation.  The  prolongation  operator  is  denoted  by  I^ 
and  it  contains  linear  interpolation  for  most  of  the  results 
presented  in  the  present  lecture  (see  also  chapter  4.3).  The 
total  correction  on  the  fine  mesh  after  n  time  steps  on  the 
fine  mesh  and  k  time  steps  on  the  coarse  mesh  is 


AWf  =  Wf -  Wf -rEAWc 


4.1.2  Analysis  of  Model  Problem 

Von  Neumann  analysis  of  a  model  problem  is  carried  out  in 
order  to  study  the  numerical  behavior  of  the  multigrid  com¬ 
ponents  defined  above.  Until  now,  we  have  used  a  2D  sca¬ 
lar  model  of  the  type 
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9w  dw  ,  9w 
dx  dy 


'  j 


With  appropriate  choices  of  a,  b,  and  c,  the  model  allows  to 
investigate  the  properties  of  multidimensional  convection 
dominated  problems  and  also  cases,  where  diffusion  takes 
over. 

If  one  uses  uniform  spacings,  Ax  and  Ay ,  for  discretiza¬ 
tion,  one  can  also  study  the  effect  of  high  aspect  ratio  cells, 

Ax  »  Ay  (4.9) 

and  convection  aligned  with  the  grid, 

aAy  »  bAx  .  (4.10) 

The  scalar  model  does  not  allow  analysis  in  situations 
where  the  eigenvalues  of  the  inviscid  flux  Jacobians  of  the 
system  of  flow  equations  differ  due  to  large  differences  in 
the  acoustic  and  convective  wave  speeds.  These  differ¬ 
ences  are  typical  features  of  low-speed  flow  regions  and 
also  near  sonic  lines.  The  interested  reader  is  referred  to 
Refs.  [52-54]  for  more  details  about  these  problems. 

We  apply  semidiscretization  for  the  spatial  derivatives  on  a 
domain,  Q ,  which  is  covered  with  the  fine  mesh  containing 
cells  with  spacings  Ax^  and  Ay^  and  the  volume,  Vf  = 
AXfAyf.  Defining  a  time  step  on  the  fine  mesh,  for  exam¬ 
ple. 


where  the  Fourier  angles,  ,  0„ ,  vary  between  -n  and  k. 

X  y 

In  the  Von  Neumann  analysis  the  behavior  of  a  single  mode 


w. 

i.j 


is  studied  and  the  complete  result  is  obtained  by  linear  su¬ 
perposition. 

Inserting  eq.  (4.15)  and  (4.13)  into  eq.  (4.12)  one  obtains 
the  growth  of  the  amplitude  of  the  Fourier  mode. 


A,dw 

At—  =  -Zw 
dt 


Z  =  —  [a  Ay  (Isintt)^  +  (1  -  cosifij^) ) 


+  bAx  (Isin(|)y  +  (1  -  cos^^  ) 


+  2c^(l -cos(l))  ]  . 


If  the  Fourier  symbols  of  a  time  stepping  operator  used  to 
solve  eq.  (4.16)  is  denoted  by  F ,  one  can  write  eq.  (4.16)  as 

6w  =  -FZw",  (4.17) 


Atf  =  - 

'  aAyj.-1-bAXf 


the  discrete  approximation  of  eq.  (4.8)  at  point  (i,j)  reads 


g  =  1  - FZ  . 


- 


hj  =  a'^yfDx  +  bAXfD^-c— D 


Ayf  yy 


The  Fourier  symbols  of  some  selected  time  stepping 
schemes  used  as  a  smoother  for  multigrid  algorithms  are 
discussed  in  section  4.2.  Any  time-stepping  scheme  to 
solve  the  semidiscrete  equation  (4.12)  is  linearly  stable,  if 
the  Fourier  mode  does  not  grow  in  time,  that  is 


D^,  Dy  and  Dyy  denote  the  difference  operators  used  to  ap¬ 
proximate  the  first  and  second  derivatives  of  eq.  (4.8),  re¬ 
spectively.  Suppose  we  want  to  investigate  first-order 
upwind  differencing  for  the  convective  terms.  Then,  we 
obtain 


for  a  >  0  and  b  >  0.  Difference  operators  for  higher-order 
discretization  may  be  found  in  Refs.  [55-56]. 

Assuming  a  periodic  boundary  condition,  the  scalar  func¬ 
tion,  w(x,y,t),  can  be  expressed  by  a  Fourier  series 


4.1.3  Multigrid  Analysis 

If  a  multigrid  algorithm  is  used  to  solve  semidiscrete  equa¬ 
tion  (4.12),  the  resulting  iteration  operator  becomes  a  ma¬ 
trix  according  to  Hackbusch  [57].  Accordingly,  the  Fourier 
transform  for  analysis  of  ah  iteration  with  multigrid  is  a 
matrix  with  the  dimension  2*"*  x  2^"*  where  1  denotes  the 
number  of  grid  levels  involved.  Analysis  of  this  type  for 
fluid  mechanics  has  been  published  by  Mulder  [58], 
Leclercq  [59],  and  Eliasson  [60]. 

As  an  alternative,  Jameson  [61]  has  presented  a  so-called 
uniform  analysis  which  simplifies  the  Fourier  transform  of 
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the  matrix  to  a  scalar.  With  the  multilevel  uniform  analysis, 
fine-grid  and  coarse-grid  corrections  are  formally  com¬ 
puted  at  all  points  of  the  fine  grid.  Then  a  nonlinear  filter  is 
applied  to  remove  the  coarse-grid  corrections  at  fine-grid 
points  not  contained  in  the  coarse-grid.  The  filtering  intro¬ 
duces  errors  in  the  analysis  for  the  grid  points  not  con¬ 
tained  on  the  coarse  grid,  that  is,  it  does  not  allow  for  the 
coupling  effects  due  to  the  interpolation  operator  in  the 
multigrid  method.  However,  it  does  offer  the  advantages  of 
simplicity  and  easy  application  to  more  than  two-level 
schemes.  Thus,  it  allows  the  rapid  comparison  of  different 
multigrid  algorithms.  If  a  multigrid  method  is  unstable  or 
inefficient  according  to  this  analysis,  then  it  is  certainly  not 
a  reasonable  scheme. 

In  order  to  apply  the  uniform  scheme  analysis  one  needs 
the  Fourier  symbols  of  the  multigrid  components.  The  Fou¬ 
rier  symbol  of  the  injection  operator  from  eq.  (4.2)  is  sim¬ 
ply  1.  The  weighted  residual  transfer  operator  in  2D, 

IF  = 

has  a  Fourier  transform, 

if  =  ( 1 -t- cos©^)  ( 1 -r  cos0y) .  (4.20) 

As  for  prolongation,  we  consider  only  the  mesh  points 
which  are  contained  in  the  coarse  mesh  and  the  fine  mesh. 
Hence,  Fourier  transform  of  prolongation  is  simply  1. 

4.2  Smoothing  Schemes 

This  section  discusses  two  selected  schemes  to  iterate  the 
semidiscrete  equation 


towards  its  desired  steady  state  solution.  The  chosen  ex¬ 
plicit  and  implicit  schemes  are  characterized  by  their  low 
operation  count  and  storage  requirements.  The  analysis  for 
ID  and  2D  scalar  model  problems  indicates  good  damping 
properties  of  these  schemes  for  high-frequency  compo¬ 
nents  of  transient  errors  whereas  the  long  waves  which  oc¬ 
cur  on  fine  coordinate  meshes  are  slowly  damped.  There¬ 
fore,  these  schemes  may  be  taken  as  smoother  for  a 
multigrid  method. 


explicit  schemes  usually  do  not  require  start-up  proce¬ 
dures.  The  most  simple  multistage  scheme  with  p  stages 
reads 

w(°)  =  w" 

w(')  =  w«4 ,1  =  1,2, . ,p  (4.21) 

wn+  I  =  w(p) 

One  can  always  represent  the  change  of  the  Fourier  mode, 
w  ,  according  to  eq.  (4.16)  by  substitution  of  eq.  (4.14)  into 
(4.21).  This  yields  the  amplification  rate,  g,  as  function  of 
the  Fourier  angles  <I>y.  Fig.  34  presents  results  of  a  3- 
stage  scheme  and  first-order  upwind  spatial  discretization 
for  a  ID  convection  problem  taken  from  Ref.  [62].  High- 
frequency  error  modes  for  7i/2  <©^^<71  are  well  damped 
whereas  the  damping  for  lower  frequencies  is  poor.  The 
Courant  number  of  this  scheme  is  limited  to  about  1 .5.  This 
indicates  that  transient  errors  in  the  solution  are  convected 
out  of  the  computational  domain  at  a  relatively  low  rate  per 
time  step. 

In  eq.  (4.21)  we  have  assumed  that  both  the  central  (con¬ 
vective)  and  dissipative  parts  of  the  spatial  discretization 
operator,  Z ,  are  evaluated  on  each  stage  of  the  time  step¬ 
ping  scheme.  Somewhat  more  freedom  in  the  design  of 
multistage  schemes  is  gained  by  evaluating  the  dissipative 
parts  only  at  q  out  of  p  total  stages.  Moreover,  the  dissipa¬ 
tion  evaluations  may  be  weighted.  If  one  defines  C  and  D 
as  the  centrally  discretized  and  dissipative  part  of  the  flux 
approximation,  the  residual  function  of  weighted  multi¬ 
stage  schemes  is  defined  as 

I 

R(i+>)  =  C(wd))  ^  Y|„D(w("’)), 

I  ”  (4.22) 

Z  Yim  =  '  • 

m  =  0 

One  can  extend  the  stability  range  of  the  explicit  multistage 
schemes  by  a  simple  scalar  implicit  operator  acting  on  the 
residuals.  For  two  dimensions,  this  residual  smoothing  can 
be  applied  in  the  factored  form 

( 1  -  P^V^A,)  { 1  -  Py (')  =  91/')  (4.23) 


Explicit  Multistage  Schemes 


Explicit  multistage  schemes  are  considered  here  for  several 
reasons.  They  are  simple  to  program,  in  particular  at 
boundaries,  and  for  multiblock  partitioned  computational 
domains.  Moreover,  the  number  of  stages  and  their  coeffi¬ 
cients  can  be  varied  in  order  to  optimize  both  damping  and 
convection  of  transient  disturbances  [61-62].  Finally,  the 


where  the  residual  91/')  is  defined  as 

<R/i)=^R(i),  (4.24) 

V  and  A  are  the  normal  forward  and  backward  difference 
operators.  The  smoothing  coefficients,  p,  and  p„  depend 
on  the  Courant  numbers  in  the  individual  coordinate  direc- 
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lions  according  to  Refs.  [63,64],  This  implicit  procedure 
allows  the  explicit  stability  limit  to  be  increased  by  a  factor 
of  2  to  3.  The  performance  of  a  particular  5-stage  scheme 
which  was  optimized  by  Tai  [65]  using  weighted  residual 
evaluation  and  implicit  residual  smoothing  is  displayed  in 
Fig.  35.  The  scheme  damps  disturbances  much  better  for 
the  long-wave  range,  as  compared  to  the  3-stage  scheme  of 
Fig.  4.1.  Note,  that  this  scheme  requires  only  3  evaluations 
of  dissipative  flux  terms  which  usually  take  the  majority  of 
floating  point  operations.  Contours  of  constant  amplifica¬ 
tion  factor  over  the  Fourier  angles  in  ^  and  ^  directions  for 
the  2D  convection  problem  with  a=b=l  are  shown  in 
Fig.  36.  Results  of  the  uniform  multigrid  analysis  briefly 
described  in  section  4.1  are  also  included.  It  is  seen  that  the 
multigrid  scheme  acts  as  to  improve  damping  at  lower  fre¬ 
quencies.  There  is  enough  stability  margin  of  the  scheme 
for  Fourier  angles  10|  >  n/4  which  indicates  that  the  errors 
contained  in  the  uniform  analysis  can  be  tolerated. 

4.2.2  Implicit  LU-SSOR  Scheme 
Multigrid  methods  based  on  explicit  multi-stage  schemes 
have  been  shown  to  yield  good  convergence  rates  for  both 
inviscid  and  viscous  flows.  As  seen  in  chapter  4.2.1  the 
principal  reason  for  this  is,  that  the  number  of  stages  and 
the  stage  coefficients  can  be  tuned  such  that  good  high  fre¬ 
quency  damping  is  obtained  which  is  necessary  for  an  effi¬ 
cient  multigrid  process.  However,  for  flow  problems  which 
are  governed  by  equations  with  strong  source  terms,  as  for 
example  viscous  flows  for  which  turbulence  viscosity  is 
determined  by  multi-equations  turbulence  models  and  hy¬ 
personic  non-equilibrium  flows,  a  severe  time-step  restric¬ 
tion  is  imposed  on  explicit  schemes.  This  leads  to  slow 
convergence,  even  if  a  multigrid  method  is  used.  In  order 
to  overcome  the  time-step  restriction,  some  kind  of  implicit 
operator  has  to  be  used.  Various  approaches  are  known  in 
the  literature.  Preferable  techniques  are  the  point-implicit 
treatment  of  source  terms  or  the  full  implicit  treatment  of 
all  equations.  Thus  there  is  an  urgent  need  to  develop  im¬ 
plicit  multigrid  methods. 

In  the  past  efficient  multigrid  methods  have  been  devel¬ 
oped  in  conjunction  with  implicit  schemes  (e.g.  [66-70]). 
Various  implicit  operators  have  been  used  as  a  multigrid 
driver  including  factored  and  unfactored  schemes.  This  re¬ 
port  focuses  on  the  investigation  of  the  damping  properties, 
convergence  behavior  and  stability  of  the  implicit  LU- 
SSOR  scheme  in  the  framework  of  a  standard  multigrid 
method. 

The  LU-SSOR  scheme  (Lower-Upper  Symmetric  Succes¬ 
sive  Overrelaxation)  became  quite  popular  because  of  its 
low  numerical  effort,  efficient  implementation  on  vector 
computers  and  reasonable  convergence  speed.  The  algo¬ 
rithm  belongs  to  the  class  of  factored  schemes  and  is  based 
on  the  decomposition  of  the  full  implicit  operator  into 


lower  and  upper  triangular  matrices.  The  LU-SSOR 
scheme  originally  introduced  by  Yoon  and  Jameson 
[69,71]  and  further  developed  by  Rieger  and  Jameson  [72] 
and  Yoon  and  Kwak  [73]  combines  the  advantages  of  the 
LU-factonization  and  the  symmetric  Gauss-Seidel  relax¬ 
ation.  Recently,  Yoon  et  al  [74]  and  Blazek  [70,75]  have 
used  the  LU-SSOR  scheme  as  an  effective  smoother  for  an 
efficient  multigrid  scheme.  They  have  shown  fast  conver¬ 
gence  for  many  inviscid  and  viscous  flow  problems  includ¬ 
ing  high  speed  flows. 

In  the  following  the  LU-SSOR  scheme  is  briefly  presented. 
Details  are  given  in  [75],  In  general  an  implicit  scheme  for 
the  system  of  ordinary  differential  equations  (4.1)  can  be 
formulated  as 


^  = -(pr""‘  +  (1-P)R")  (4.25) 

with  the  solution  correction 


AW=W''+‘_W"  (4.26) 

and  the  discrete  flux  balance 


R  = 


(4.27) 


where  r‘^  and  S'*  denote  the  convective  and  viscous  part, 
respectively.  The  parameter  (3  (l/2<j3<l)  determines  the  ac¬ 
curacy  of  the  implicit  scheme  in  time.  For  |3=:I/2  the 
scheme  is  second-order  accurate,  while  for  all  other  values 
the  time  accuracy  drops  to  first  order. 

Linearizing  eq.  (4.25)  by 


r""‘  =  R' 


+  ^AW-rO(At2) 

aw 


(4.28) 


and  dropping  all  terms  of  second-  and  higher-order,  one 
obtains  the  general  unfactored  implicit  scheme 


I  -t-  pAt^ 

aw. 


AW 


-At  R 


(4.29) 


For  a  grid  point  (i,j,k)  the  term  aR/aw  is  expressed  as 


aR^ 


awAj.k 
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9Ri>l/2,j.  k 

3Ru  1/2,  j,k 

[UP^  [  (A*  +A. 

+  (Bi:j,k  +  Bj-,,,k) 

+  l,  j,  k)  k  +  ^j,  k) 

aw 

aw 

^Bkj_i_k  +  Bj  j  ij) 

1/2,  k 

oRi,j-l/2,k 

+ 

aw 

^(Citj.k  +  Crj.k.l) 

-  (Ctj.k-1+Crj.k) 

c 

9Ri,j,k+  1/2 

-.C  (4.30) 

aRi.j,  k-i/2 

-(-Hi.j.k  +  H.j,,,) 

+  (-Hi..i-l.k  +  H,j,,)]  }AW 

+ 

aw 

-.^v 

oRi,j+  1/2,  k 

aw 

-.->v 

aRi,j-l/2,k 

=  AtR"j,k  (4.34) 

aw 

^  ^ 
aw 

The  LU  factorization  of  the  implicit  operator  of  eq.  (4.34) 

then  yields 


Here  it  is  assumed,  that  the  thin  layer  approximation  of  the 
Navier-Stokes  equations  is  used  in  which  the  viscous  flux 
only  in  the  wall-normal  direction  (j-direction)  are  taken 


(LD->  U)  AW  =  -At  Rij,  k  (4.35) 


into  account. 


with  the  factors 


The  evaluation  of  the  quantities  on  the  right  hand  side  of 

eq.  (4.30)  in  terms  of  the  flux  Jacobians  yields  D  =  (4.36) 


[aR/awJi.j,k  = 

[(A^.k  +  Ar^l.j,lc)  - 

+  (C-.^c-.^,)  -  (qj,k_,+c-.k) 

(4.31) 

where  A“,  B~,  C*  denote  the  split  matrices  of  the  flux  Jaco¬ 
bians  A,B,C  in  i-,j-k-direction,  respectively.  The  matrices 
with  superscript  ‘+‘  contain  only  positive  and  those  with 
superscript  only  negative  eigenvalues.  As  proposed  in 
[72]  they  are  given  by 


+  Ctj,k-Cr,k  +  2H,jk]  } 

L  =  (D  -  [A-_  ,  ^  +  q.  ,  +  H, . .  ,  ,]  } 

U  =  {D  +  Pf  j,  k  +  Br, ,  +  Cr.  ,,  ,  -  H.  j 


The  use  of  splitting  according  to  eq.  (4.32)  allows  a  simpli¬ 
fied  evaluation  of  the  diagonal  operator  D 

D  =  l  +  p|5[(r^  +  r3  +  rc.  .  ,)+2H,  .  ,]  .  (4.37) 


A±  =l(A±Q)r^l)  (4.32) 

with 

=  max{|X|:A.  eigenvalue  of  matrix  A}  .  (4.33) 

The  factor  (O,  co>l,  determines  the  amount  of  implicit  dissi¬ 
pation  and  hence  influences  the  damping  and  convergence 
properties  of  the  scheme. 

The  terms  B*  and  are  defined  in  the  same  manner.  The 
matrix  H  in  eq.  (4.31)  corresponds  to  a  viscous  flux  Jaco¬ 
bian  without  the  spatial  operators.  It  has  been  found  that  in 
the  framework  of  a  finite  volume  formulation  the  use  of 
correct  metric  terms  is  a  critical  point.  For  details  the 
reader  is  referred  to  references  [75]. 

Inserting  expression  (4.31)  into  equation  (4.29)  one  obtains 


The  diagonal  dominance  of  the  factors  L  and  U  is  provided 
by  eq.  (4.32).  Hence,  each  factor  of  the  decomposition  is 
diagonally  dominant  and  thus  the  numerical  stability  of  the 
inversion  process  is  ensured. 

As  demonstrated  in  [75]  one  iteration  of  the  LU-SSOR 
scheme  is  carried  out  in  two  steps,  a  forward  and  a  back¬ 
ward  sweep 

LAW*  =  -At  Ri"j,k 

UAW  =  DA^*  (4-38) 

W  =  W  +  AW. 

The  sweeps  are  accomplished  along  diagonal  lines.  As  a 
consequence,  in  comparison  to  the  most  other  implicit 
schemes,  only  a  block  diagonal  matrix  for  viscous  flows  - 
or  even  a  scalar  diagonal  for  inviscid  flows  -  instead  of 
block  tri-  or  pentadiagonal  matrices  has  to  be  inverted. 
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block  tri-  or  pentadiagonal  matrices  has  to  be  inverted. 
This  reduces  the  numerical  effort  significantly  and  it  also 
allows  a  straight  forward  vectorization.  Furthermore,  as 
shown  in  [72]  the  Jacobian  matrices  can  be  substituted  by 
fluxes  which  considerably  reduces  the  number  of  opera¬ 
tions.  All  in  all,  the  computational  expense  of  the 
LU-SSOR  scheme  is  comparable  to  that  of  an  explicit  two- 
stage  scheme  (see  chapter  4.2.1). 

In  combination  with  multigrid  the  LU-SSOR  scheme  de¬ 
scribed  above  is  used  as  smoother  on  all  grid  levels.  Its 
damping  properties  have  been  investigated  in  detail  by  Bla- 
zek  [75]  using  single-  and  a  two-grid  von  Neumann  Fourier 
analysis  as  described  in  section  4.1.  Central  as  well  as 
upwind  spatial  discretizations  of  the  explicit  operator  (right 
hand  side  of  eq.  (4.29))  have  been  considered.  Figs.  37  and 
Fig.  38  present  contours  of  the  single-grid  amplification 
factor  for  the  central  discretization  and  the  second-order 
upwind  discretization  for  a  fully  convection  dominated 
model  problem.  In  both  cases  the  influence  of  the  relax¬ 
ation  parameter  CO  (eq.  (4.32))  is  shown.  For  the  central  dis¬ 
cretization  (Fig.  37)  the  best  damping  is  obtained  with  co=l 
(no  overrelaxation).  However,  it  is  evident,  that  compared 
to  the  very  good  damping  properties  of  explicit  multi-stage 
schemes  the  high  frequency  modes  are  only  poorly  damped 
with  the  implicit  LU-SSOR  scheme.  Fig.  38  shows  that  in 
the  case  of  upwind  discretization  the  high  frequency  damp¬ 
ing  behavior  of  the  LU-SSOR  scheme  can  be  significantly 
improved  by  a  moderately  increased  relaxation  parameter. 
Further  improvement  can  be  achieved  by  spending  two 
time  steps  on  the  fine  grid. 

Despite  the  fact  that  compared  to  the  explicit  multi-stage 
schemes  the  damping  behavior  of  the  implicit  LU-SSOR 
scheme  is  rather  poor,  an  efficient  and  robust  multigrid 
method  driven  by  the  LU-SSOR  scheme  can  be  con¬ 
structed  as  demonstrated  in  [75].  Fig.  39  displays  the  con¬ 
vergence  behavior  for  the  transonic  inviscid  flow  past  the 
NACA  0012  airfoil  for  M__^=0.8,  a=L25°.  An  0-type  grid 
with  160x48  cells  has  been  used  and  the  right  hand  side  has 
been  discretized  by  central  differences.  The  convergence 
histories  of  a  single-grid  and  two  different  5-level  multi¬ 
grid  schemes  are  compared.  The  number  in  the  parentheses 
denote  the  number  of  time  steps  on  each  grid,  ordered  from 
the  finest  to  the  coarsest  one.  As  one  can  observe,  a  signifi¬ 
cant  improvement  of  the  convergence  is  obtained  by  using 
the  LU-SSOR  scheme  in  combination  with  multigrid. 
Note,  that  the  usual  multigrid  scheme  with  one  time  step  on 
each  grid  was  not  running  stable.  This  may  be  due  to  the 
poor  high  frequency  damping  of  the  LU-SSOR  scheme. 
Fig.  40  presents  the  convergence  behavior  of  the  implicit 
multigrid  scheme  for  a  hypersonic  laminar  flow  past  a  15° 
compression  ramp  at  a  medium  Reynolds  number  (test  case 
IIL4  of  Antibes  workshop  see  Figs.  32-33).  The  flow  pa¬ 
rameters  are  M^=1L68,  Re(,=2.47xl0^,  T^=65k  and 


T^/T^  =4.604.  A  computational  grid  with  288x224  cells 
has  been  used.  Fig.  40  displays  the  convergence  histories 
of  different  multigrid  strategies.  As  one  can  observe,  the 
multigrid  scheme  with  two  time  steps  on  all  grids  shows 
the  fastest  convergence  and  requires  by  far  the  shortest 
CPU-time  for  the  same  residual  level.  It  is  also  evident 
from  Fig.  40  that  the  LU-SSOR  scheme  is  only  in  combi¬ 
nation  with  multigrid  adequate  to  solve  this  flow  problem. 

4.3  Multigrid  Strategies 

The  numerical  simulation  of  high  Reynolds  number  flows 
requires  coordinate  meshes  with  high-aspect  ratio  cells  in 
order  to  resolve  thin  shear  layers  with  a  reasonable  number 
of  grid  points.  This  renders  the  discretized  flow  equations 
stiff  because  the  spectral  radii  of  the  flux  Jacobian  in  wall- 
normal  and  tangential  coordinate  directions  are  very  differ¬ 
ent.  Consequently,  convergence  to  the  steady  state  slows 
down  considerably  for  such  flows  if  no  action  is  taken  to 
circumvent  the  problem.  Similarly,  stiffness  occurs  in  situ¬ 
ations  where  the  flow  is  aligned  with  the  grid  lines  and 
hence,  the  numerical  dissipation  inherent  in  modern 
upwind  schemes  vanishes.  One  possibility  to  cope  with 
stiffness  resulting  from  high-aspect  ratios  is  to  use  specific 
multigrid  strategies  in  order  to  improve  damping  rates.  The 
semicoarsening  method  introduced  by  Mulder  [58]  is  one 
possible  approach.  Fig.  41  gives  a  sketch  of  the  idea  for 
two  grid  levels.  With  conventional  full  coarsening  the  fine 
mesh  with  m  x  n  cells  is  coarsened  to  yield  a  mesh  with 
m/2  X  n/2  cells.  Figs.  41b-d  show  schemes  with  semicoars¬ 
ening  in  the  different  coordinate  directions,  which  use  two 
coarse  meshes,  m/2  x  n,  and  m  x  n/2.  The  various  semi¬ 
coarsening  schemes  differ  in  how  the  corrections  on  the 
coarse  meshes  are  assembled  and  prolongated  according  to 
Ref.  [76].  The  coarse-mesh  corrections  of  the  scheme. 
Fig.  41b,  are  averaged  before  adding  them  to  the  fine  mesh. 
This  is  indicated  by  the  numbers  at  the  “up“  arrows.  Due  to 
this  averaging,  half  of  the  individual  corrections  on  the 
coarse  meshes  is  lost. 

In  order  to  overcome  this  deficiency  of  the  semicoarsening 
scheme,  two  more  variants  are  considered.  For  the  scheme 
of  Fig.  41c,  the  solutions  on  the  coarse  meshes  are  com¬ 
puted  sequentially.  Hence,  the  corrections  obtained  on  the 
m/2  X  n  mesh  can  be  used  to  update  the  m  x  n/2  mesh  be¬ 
fore  time  stepping  (as  indicated  by  the  horizontal  arrow). 
The  sequential  update  of  the  second  coarse  mesh  allows  the 
full  amount  of  corrections  to  be  passed  up  to  the  fine  mesh. 
An  interesting  compromise  between  the  schemes  of 
Figs.  41b  and  41c  is  displayed  in  Fig.  41d.  Here,  only  the 
corrections  common  to  both  of  the  coarse  meshes,  m/2  x  n, 
and  m  x  n/2,  are  averaged,  whereas  the  corrections  to  the 
modes  living  either  on  m/2  x  n  or  on  m  x  n/2  are  passed  to 
the  fine  mesh  in  full. 

For  the  numerical  solution  of  the  Navier-Stokes  equations, 
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the  two-level  strategies  presented  in  Fig.  41  are  extended  to 
multilevel  schemes  as  displayed  in  Fig.  42  following  ideas 
of  Mulder  [58],  Suitable  coordinate  meshes  for  thin  bound¬ 
ary  layers  exhibit  mostly  cells  with  high  aspect  ratios  in  the 
surface-aligned  direction.  Fig.  43  displays  further  variants 
of  semicoarsening  for  these  situations  which  are  computa¬ 
tionally  cheaper  than  the  schemes  shown  in  Fig.  42. 
Detailed  numerical  investigations  for  various  viscous  flow 
problems  have  been  reported  in  Ref  [76].  A  sample  result 
is  presented  in  Figs.  44-45.  The  flow  over  a  slender  fore¬ 
body  is  chosen  to  represent  a  generic  configuration  corre¬ 
sponding  to  an  air-breathing  high-speed  transport.  The  high 
Reynolds  number  requires  a  mesh  with  aspect  ratios  up  to 
25000.  The  flow  computations  where  done  with  boundary 
layer  transition  fixed  at  2  percent  chord.  The  flow  solution 
shown  in  Fig.  44  was  extensively  investigated  with  respect 
to  both  grid  convergence  and  residual  convergence.  The 
convergence  histories  presented  in  Fig.  45  indicate  sub¬ 
stantial  convergence  acceleration  by  multigrid.  The  se¬ 
quential  semicoarsening  scheme  takes  194  cycles  and  570s 
on  CRAY-YMP  to  reduce  the  averaged  residuals  by  6  or¬ 
ders  of  magnitude.  The  scheme  with  full  coarsening  takes 
1024  and  1230s  whereas  the  single-mesh  code  requires 
7762  steps  and  6190s  to  achieve  the  same  level.  We  con¬ 
clude  that  suitable  multigrid  strategies  can  improve  compu¬ 
tational  efficiency  by  an  order  of  magnitude  for  tough  flow 
problems. 

5.  APPLICATIONS 

Applications  to  complex,  three-dimensional  configura¬ 
tions  are  given  m  this  section.  To  demonstrate  the  range  of 
applicability,  sub-,  trans-,  and  hypersonic  flow  fields  are 
considered.  Since  the  solution  method  uses  structured, 
body-fitted  meshes,  the  multiblock  approach  is  employed. 
Hence,  this  section  begins  with  an  outline  of  the  multiblock 
concept.  The  first  problem  to  be  presented  is  concerned 
with  the  interaction  of  a  jet  with  a  multi-element  wing  at 
subsonic  speed.  The  next  application  deals  with  engine-air¬ 
frame  integration  for  transonic  transport  aircraft.  At  last  the 
flow  field  around  a  reentry  vehicle  at  hypersonic  speeds 
will  be  analyzed. 

5.1  Multiblock  Approach 

Using  structured,  body-fitted  meshes  the  physical  domain 
is  decomposed  into  a  set  of  computational  cells  by  the  cur¬ 
vilinear  coordinates  ^  ,  q  ,  and  ^  ,  as  sketched  in  Fig.  46  for 
a  three-dimensional  wing.  The  curvilinear  coordinates  al¬ 
low  the  mapping  of  the  physical  domain  into  a  computa¬ 
tional  domain  as  shown  in  Fig.  47,  where  the  computa¬ 
tional  coordinates  i,  j,  k  are  defined  along  the  curvilinear 
directions  ^ ,  q  ,  and  .  With  the  indices  i,  j,  k  each  point  in 
the  computational  domain  and  his  neighbors  may  directly 
be  identified,  and  the  underlying  structure  allows  an  easy 


implementation  of  the  solution  algorithm  on  vector  com¬ 
puters. 

For  complex  configurations  in  general  it  is  not  possible  to 
map  the  physical  domain  into  one  coherent  computational 
domain.  Therefore,  the  physical  domain  of  interest  is  de¬ 
composed  into  different  appropriate  regions  which  are 
called  blocks.  Each  block  is  mapped  into  a  separate  com¬ 
putational  domain,  and  the  flow  solver  is  then  repeatedly 
applied  to  the  different  blocks.  In  order  to  establish  a  com¬ 
munication  between  the  blocks,  data  has  to  be  transferred 
between  adjacent  block  faces.  In  the  DLR  CEVCATS  code 
considered  here,  the  exchange  of  data  is  established  by  us¬ 
ing  the  concept  of  fictitious  points,  as  sketched  for  a  two- 
dimensional  example  in  Fig.  48.  For  a  2D  problem  the  real 
computational  domain  ranges  from  i  =  2  to  i  =  imax  and  j  = 
2  to  j  =  jmax.  This  real  computational  domain  is  sur¬ 
rounded  by  a  sheet  of  fictitious  cells,  as  indicated  by  the 
dashed  lines  in  Fig.  48.  Considering  a  simple  0-mesh 
around  an  airfoil,  the  physical  domain  may  be  mapped  into 
the  computational  domain  by  introducing  a  computational 
cut  at  i  =  2  and  i  =  imax,  see  Fig.  49.  Since  in  the  physical 
domain  the  block  faces  at  i  =  2  and  i  =  imax  are  adjacent  to 
each  other,  the  exchange  of  data  can  be  performed  by  load¬ 
ing  the  data  of  line  i  =  3  into  the  line  of  fictitious  points  at  i 
=  imax-i-l.  and  by  loading  data  of  i  =  imax-1  into  the  line 
with  i  =  1,  as  sketched  in  Fig.  50.  It  should  be  noted  that 
when  using  a  vertex  based  method,  it  is  not  sufficient  to 
transfer  only  the  dependent  variables.  In  order  to  evaluate 
the  flux  balances  for  the  points  lying  directly  on  the  line  of 
the  computational  cut,  the  cartesian  coordinates  of  the  ad¬ 
jacent  block  face  have  also  to  be  provided  for  the  fictitious 
points.  However,  for  time-invariant  grids  this  needs  only  to 
be  done  once  at  the  beginning  of  the  computation. 

It  is  well  known  that  there  does  not  exist  one  optimal  grid 
topology  for  arbitrary  configurations.  Each  aerodynamic 
component  of  an  aircraft  may  have  its  own  natural  grid 
structure,  and  different  configurations  call  for  different 
block  arrangements.  Therefore,  the  part  of  a  computer  code 
which  depends  on  the  specific  configuration  has  to  be  kept 
to  a  minimum  to  allow  an  easy  change  of  grid  topologies. 
In  the  CEVCATS  code  this  flexibility  is  provided  by  an  ex¬ 
ternal  logic-file  which  contains  all  information  about  the 
arrangement  of  blocks,  adjacent  block  faces,  and  boundary 
conditions  on  these  block  faces.  With  the  information 
stored  in  the  logic-file,  data  in  the  fictitious  cells  is  updated 
depending  on  the  boundary  conditions  specified  on  the  par¬ 
ticular  block  face.  In  order  to  allow  a  high  flexibility  for 
complex  problems,  block  faces  may  be  subdivided  into  ar¬ 
bitrary  segments.  The  logic-file  then  identifies  the  size  and 
the  type  of  boundary  condition  on  the  segments.  The  use  of 
the  logic-file  allows  to  apply  one  source  code  to  various 
kinds  of  problems  without  the  need  to  change  and  recom¬ 
pile  the  program. 
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As  long  as  the  data  of  all  blocks  is  stored  in  the  main  mem¬ 
ory  of  the  computer,  the  fictitious  cells  at  computational 
cuts  may  be  updated  after  each  operation  inside  a  block, 
and  sweeping  successively  through  all  blocks  by  consis¬ 
tently  updating  the  block  boundaries  at  coordinate  cuts,  the 
block  structure  can  be  made  invisible  for  the  solution  algo¬ 
rithm  for  explicit  time  stepping  methods.  However,  in  or¬ 
der  to  enable  the  computation  of  problems  which  exceed 
the  storage  capacity  of  the  main  memory,  the  DLR  CEV- 
CATS  code  allows  the  storage  of  block  data  on  external 
high-speed  storage  devices.  In  this  case  only  one  block  at  a 
time  is  loaded  into  the  main  memory,  and  data  of  all  other 
blocks  is  stored  on  the  external  devices.  Having  performed 
a  certain  number  of  operations  inside  the  block,  data  is  un¬ 
loaded  onto  the  external  devices  and  data  of  the  next  block 
is  transferred  into  the  main  memory.  This  strategy  theoreti¬ 
cally  enables  the  computation  of  problems  with  an  almost 
unrestricted  number  of  grid  points.  The  problem  with  this 
strategy  is  the  high  amount  of  I/O  operations  which  arise 
when  the  cut  boundaries  should  be  consistently  updated. 
This  becomes  especially  important  when  multigrid  acceler¬ 
ation  is  used,  since  performing  more  operations  inside  a 
block  before  switching  to  the  next  one  introduces  a  time- 
lag  in  the  evolution  of  the  solution  in  different  blocks.  This 
time-lag  may  severely  deteriorate  the  damping  properties 
of  the  scheme,  which  are  mandatory  for  good  multigrid 
performance.  In  the  CEVCATS  code  different  strategies  for 
multiblock  multigrid  have  been  implemented  to  allow  the 
best  compromise  between  convergence  and  I/O  operations, 
depending  on  the  problem.  Without  going  into  the  details 
of  the  different  strategies  it  may  be  noted  that  even  in  the 
strategy  with  the  lowest  amount  of  I/O  operations,  a  sweep 
through  all  blocks  is  completed  on  one  grid  level  before 
starting  on  the  next  coarser  grid.  It  was  found  that  perform¬ 
ing  a  complete  multigrid  cycle  inside  a  block  before 
switching  to  the  next  block  degrades  the  multigrid  perfor¬ 
mance  to  that  of  a  code  without  multigrid  acceleration  or 
even  inhibits  convergence.  The  application  of  Full  Multi¬ 
grid  may  alleviate  the  problems  associated  with  the  time- 
lag,  since  the  solution  which  evolved  on  coarser  meshes 
provides  a  well  conditioned  starting  solution  on  the  finest 
mesh,  and  time-differences  between  blocks  are  then  al¬ 
ready  rather  small. 

Details  of  the  implementation  of  the  multiblock  multigrid 
technique  into  the  CEVCATS  code  may  be  found  in  [77] 
and  [78]. 

5.2  Interaction  of  a  Jet  with  a  Multi-Element  Wing 

The  influence  of  a  jet  on  a  High-Lift  device  was  investi¬ 
gated.  It  was  assumed  that  the  flow  field  will  be  dominated 
by  the  momentum  of  the  jet  flow,  and  the  solution  of  the 
Euler  equations  was  regarded  as  being  sufficient  to  de¬ 
scribe  the  main  flow  phenomena.  The  greatest  challenge 


was  to  decide  on  an  appropriate  grid  topology.  On  the  one 
hand  the  components  of  the  High-Lift  device  had  to  be  suf¬ 
ficiently  resolved,  and  on  the  other  hand  the  jet  generator 
had  to  be  incorporated  into  the  mesh.  Therefore,  a  prelimi¬ 
nary  two-dimensional  study  was  performed  to  investigate 
different  grid  topologies  for  multi-element  airfoils.  In  the 
finally  chosen  topology  all  single  components  were  re¬ 
solved  by  local  0-meshes  around  each  component,  and  the 
0-meshes  were  then  embedded  into  a  global  H-mesh.  The 
grid  was  generated  with  the  mesh  generation  tool  MEGA- 
CADS  [79].  Fig.  51  gives  a  view  of  the  2D  mesh  around 
the  complete  multi-element  airfoil,  and  Fig.  52  and  Fig.  53 
show  the  mesh  topology  in  the  region  of  the  slat  and  in  the 
region  of  flap  and  tab. 

Since  the  CEVCATS  code  has  an  option  for  the  computa¬ 
tion  of  two-dimensional  flows  on  block  structured  grids, 
the  same  source  code  as  for  the  following  three-dimen¬ 
sional  computations  could  be  used  for  this  preliminary  test 
problem.  Fig.  54  shows  the  pressure  distribution  computed 
for  M„  =  0.182  and  a  =  10°.  The  corresponding  distribu¬ 
tion  of  total  pressure  losses  is  displayed  in  Fig.  55.  On  all 
components  total  pressure  losses  are  well  below  2%.  The 
convergence  history  for  this  case  is  given  in  Fig.  56,  where 
a  W-cycle  with  four  grid  levels  had  been  used. 

The  described  grid  topology  had  proved  to  be  adequate  for 
this  problem,  and  the  incorporation  of  the  jet-generator  was 
achieved  as  sketched  in  Fig.  57.  Fig.  58  shows  a  view  of 
the  symmetry  plane  of  the  final  grid,  where  the  components 
of  the  multi-element  wing  and  the  jet-generator  are  dis¬ 
played  as  solid  objects.  The  jet-generator  had  been  re¬ 
solved  by  a  local  polar  mesh,  and  this  polar  mesh  was  em¬ 
bedded  into  the  global  mesh,  as  shown  in  Fig,  59. 

First  computations  were  performed  at  M^  =  0. 182  and  a  = 
10°,  and  the  ratio  of  the  total  pressure  of  the  jet  to  the  ambi¬ 
ent  static  pressure  was  chosen  to  P,  je/Pp,,  =  2.0.  At  these 
conditions  the  Mach  number  of  the  jet  is  close  to  M„„  =  0.9 
at  the  exit  of  the  jet  generator.  Fig.  60  shows  the  Mach 
number  distribution  in  the  symmetry  plane.  The  jet  can  be 
identified  by  the  concentration  of  isolines  at  the  jet  bound¬ 
aries.  Due  to  the  numerical  viscosity,  the  boundaries  are 
spread  into  regions  of  large  gradients  instead  of  being  dis¬ 
continuities.  Since  for  these  calculations  the  basic  cell  ver¬ 
tex  central  differencing  scheme  had  been  used,  the  smear¬ 
ing  effect  of  the  scalar  dissipation  is  clearly  visible.  The  jet 
passes  very  closely  beneath  the  slat,  and  due  to  the  pres¬ 
ence  of  flap  and  tab  the  jet  is  deflected  by  nearly  25".  In 
Fig.  61  the  corresponding  streamline  pattern  is  displayed. 
When  the  jet  hits  flap  and  tab,  streamlines  are  running 
against  the  main  flow  direction  around  the  leading  edges  of 
flap  and  tab.  Fig.  62  gives  an  enlargement  of  the  region 
around  the  tab.  Since  the  streamlines  are  following  the  sur¬ 
faces  of  flap  and  tab,  the  momentum  of  this  deflected  part 
of  the  flow  leads  to  a  deflection  of  the  total  jet.  The  interac- 
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tion  between  jet  and  flap/tab  influences  a  large  region  of 
the  flow  around  the  wing.  Fig,  63  shows  the  streamline  pat¬ 
tern  on  the  lower  wing  surface.  Hitting  flap  and  tab,  fluid 
diverts  in  all  directions.  It  takes  about  five  engine  diame¬ 
ters  apart  from  the  symmetry  plane  until  the  main  flow  di¬ 
rection  prevails  again.  It  should  be  noted  that  the  boundary 
at  the  wing  tip  was  modelled  by  solid  wall  conditions  to 
simulate  the  wind  tunnel  walls. 

For  the  onflow  conditions  of  M„  =  0.147,  a  =  10°,  and  a 
pressure  ratio  of  Pt  je/Poc  =  1.252,  experimental  data  were 
available.  Sectionwise  pressure  distributions  were  mea¬ 
sured  in  the  symmetry  plane,  half  an  engine  diameter  apart 
from  the  symmetry  plane,  and  one  engine  diameter  apart 
from  the  symmetry  plane.  Figs.  64-66  show  a  comparison 
of  experimental  and  computational  data.  The  qualitative 
agreement  between  calculation  and  experiment  is  quite 
good,  despite  the  neglection  of  viscous  effects.  The  influ¬ 
ence  of  the  jet  on  the  pressure  distribution  in  different  sp- 
anwise  direction  is  accurately  predicted  by  the  calculation. 

5.3  Engine  Integration  for  Transport  Aircraft 
Engine/airframe  integration  is  a  key  feature  in  the  design 
and  development  of  advanced  technology  aircraft,  since 
the  interaction  between  propulsion  system  and  airframe 
can  have  a  significant  impact  on  the  performance  of  the  air¬ 
craft.  It  is  evident  that  an  optimal  integration  of  the  propul¬ 
sion  system  into  the  airframe  will  result  in  an  enhanced 
performance  of  the  whole  aircraft.  In  order  to  get  a  better 
understanding  of  the  aerodynamic  phenomena  playing  the 
major  roles  in  the  interference  process,  substantial  efforts 
have  been  made  to  simulate  interference  effects.  Besides 
wind  tunnel  testing  numerical  methods  are  increasingly 
gaining  attention,  and  the  solution  of  the  Euler  equations 
has  successfully  been  used  to  predict  interference  effects 
[80,  81,  82].  However,  the  flow  around  modern  transonic 
wings  is  very  sensitive  to  viscous  effects,  and  neglecting 
viscosity  leads  to  systematic  deviations  from  experimental 
results  [83].  Therefore,  the  Navier-Stokes  equations  have 
to  be  solved  for  an  adequate  simulation.  For  complex  con¬ 
figurations  grid  generation  becomes  a  substantial  chal¬ 
lenge,  especially  for  viscous  flows,  since  the  boundary  lay¬ 
ers  on  all  components  have  to  be  resolved.  To  alleviate  the 
necessary  effort  and  to  approach  the  task  of  generating  a 
viscous  grid  for  the  complete  configuration  step  by  step,  it 
therefore  seems  appropriate  to  first  resolve  only  the  bound¬ 
ary  layer  on  the  wing  and  to  treat  all  other  components  as 
in  inviscid  flow. 

In  the  study  to  be  presented  here,  the  DLR-F6  configura¬ 
tion  has  been  selected  as  a  generic  twin-engine  transport 
aircraft  configuration.  The  propulsion  system  is  simulated 
by  axisymmetric  throughflow  nacelles,  and  the  nacelle  po¬ 
sition  was  chosen  to  give  rise  to  quite  strong  interference 
effects.  Fig.  67  presents  a  view  of  the  model  in  tail-off  con¬ 


figuration  including  the  main  geometrical  dimensions. 
Using  block-structured  methods,  an  appropriate  grid  topol¬ 
ogy  has  to  be  chosen.  On  the  one  hand  different  engine  sys¬ 
tems  may  have  to  be  realized,  and  on  the  other  hand  the 
boundary  layer  around  the  wing  has  to  be  resolved  ade¬ 
quately.  Here  a  global  H-topology  in  streamwise  direction 
and  an  0-topology  in  spanwise  direction  have  been  chosen. 
Nacelle  and  pylon  have  been  embedded  into  this  grid  by 
using  a  local  polar  subgrid  with  an  H-type  topology  in  stre¬ 
amwise  direction.  Fig.  68  shows  selected  grid  planes  to  vi¬ 
sualize  the  spanwise  topology  of  the  wing  and  the  nacelle. 
To  resolve  the  wing  boundary  layer,  an  C-grid  wrapped 
around  the  wing  has  been  integrated  into  the  global  H-O  to¬ 
pology.  Fig.  69  presents  the  resulting  H-C-0  topology.  The 
C-block  is  generated  using  the  surface  normal  vectors  of 
the  wing,  and  the  first  distance  off  the  wall  is  about  1.0  x 
10'^,  Fig.  70  shows  a  grid  plane  at  the  pylon  location 
through  the  nacelle  to  display  the  embedded  C-grid.  It 
should  be  noted  that  in  the  figures  not  all  grid  lines  have 
been  displayed  to  allow  a  clear  presentation.  The  complete 
field  grid  consisted  of  about  1,200,000  cells,  and  14  com¬ 
putational  blocks  had  been  used.  The  number  of  blocks 
was  not  only  dictated  by  topological  requirements,  but  the 
maximum  block  size  had  to  be  adapted  to  the  limited  main 
memory  of  the  computer. 

Experiments  for  the  DLR-F6  model  have  been  carried  out 
in  the  S2MA  wind  tunnel  of  ONERA  [83].  Pressure  distri¬ 
butions  have  been  measured  at  eight  different  wing  sec¬ 
tions  with  two  of  them  located  closely  inboard  and  out¬ 
board  of  the  pylon.  Transition  was  fixed  and  the  Reynolds 
number  was  kept  constant  to  Re  =  3.0  x  10^.  The  results  to 
be  presented  here  are  restricted  to  typical  cruise  conditions 
of  a  transonic  transport  aircraft  at  =  0.75  and  a  =  0.98°. 
For  the  computations  the  flow  was  assumed  to  be  fully  tur¬ 
bulent  and  the  algebraic  turbulence  model  of  Baldwin  and 
Lomax  [84]  was  used  in  the  solution  of  the  Reynolds-aver¬ 
aged  Navier-Stokes  equations. 

Fig.  71  shows  a  comparison  of  measured  and  computed 
pressure  distributions  at  two  sections  inboard  of  the  pylon, 
and  Fig.  72  gives  the  comparison  for  two  sections  located 
outboard.  The  exact  location  of  the  sections  is  given  in  the 
sketch  in  the  figures.  The  shock  location  predicted  by  the 
computation  agrees  favorably  with  the  experimental  data. 
The  interference  effects  caused  by  the  nacelle  are  clearly 
visible  by  the  difference  in  the  pressure  distributions  just 
inboard  (y/s  =  0.331)  and  outboard  (y/s  =  0.377)  of  the  py¬ 
lon.  At  y/s  =  0.33 1  on  the  lower  wing  surface  a  strong  flow 
acceleration  occurs.  The  computation  accurately  predicts 
the  corresponding  pressure  peak.  Outboard  at  y/s  =  0.377 
however,  the  flow  is  only  accelerated  around  the  pylon 
leading  edge,  but  then  no  further  acceleration  occurs.  This 
difference  between  inboard  and  outboard  side  of  the  pylon 
is  simulated  in  agreement  with  the  experiment,  indicating 
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that  in  this  case  interference  is  mainly  caused  by  the  dis¬ 
placement  effect  of  pylon  and  nacelle.  Besides  this  overall 
agreement,  there  are  still  discrepancies  in  the  simulation. 
Downstream  of  the  shock  an  overexpansion  occurs  in  the 
computation,  which  is  not  observed  in  the  experiment.  Fur¬ 
thermore,  the  effect  of  the  rear  loading  is  overpredicted  by 
the  computation.  The  reason  for  these  effects  is  still  not 
clear.  On  the  one  hand  the  wing  had  a  blunt  trailing  edge 
which  was  artificially  closed  for  the  computation.  On  the 
other  hand  the  grid  distortion  in  the  vicinity  of  the  pylon 
may  be  too  large  and  lead  to  a  reduction  of  solution  accu¬ 
racy.  Computations  of  the  configuration  without  nacelle 
gave  better  agreement  with  experimental  data  [84].  The 
overexpansion  downstream  of  the  shock  and  the  overpre¬ 
diction  of  the  rear  loading  lead  eventually  to  an  overpredic¬ 
tion  of  the  spanwise  lift  distribution.  Fig.  73  shows  a  com¬ 
parison  of  measured  and  calculated  spanwise  lift 
distributions.  Besides  the  overprediction  of  lift,  the  charac¬ 
teristic  discontinuity  at  the  pylon  location  is  accurately  pre¬ 
dicted  by  the  computation.  The  computations  have  been 
caiTied  out  on  the  CRAY  Y-MP  computer  of  DLR,  and 
Fig.  74  presents  the  convergence  history  for  this  case.  Full 
Multigrid  has  been  used  with  4  grid  levels  on  the  first 
mesh.  The  residual  could  be  reduced  by  3  orders  of  magni¬ 
tude  within  150  iterations.  The  computation  required  about 
6000  seconds  of  CPU-time  on  the  CRAY  Y-MP. 

5.4  Aerothermodvnamics  of  Winged  Reentry  Vehicles 
At  hypersonic  flow  conditions  the  thermal  stability  of  the 
materials  used  for  the  fabrication  of  the  flight  vehicle  limits 
the  maximum  allowed  heating  of  the  surfaces.  The  heating 
becomes  critical  during  reentry  maneuvers  at  high  Mach 
numbers  where  peak  heating  rates  occur  at  the  nose  of  the 
vehicle,  along  the  leading  edges  of  wing  and  winglet,  and 
on  deflected  control  surfaces  which  are  necessary  to 
achieve  equilibrium  in  pitching  moment. 

Fig.  75  shows  the  European  space  plane  HERMES  which 
is  a  typical  design  for  personnel  transport  to  orbit  and  re¬ 
turn  missions.  The  critical  heat  loads  on  HERMES  config¬ 
uration  during  reentry  have  been  analyzed  using  a  series  of 
global  and  local  flow  solutions  which  were  computed  with 
the  DLR  multiblock  code  CEVCATS.  The  hypersonic  flow 
computations  require  high  resolution  of  very  strong  shocks 
and  thin  temperature  layers  near  the  surfaces.  Therefore, 
the  hybrid  AUSM  scheme  described  in  section  3  was  em¬ 
ployed  for  spatial  discretization  instead  of  the  central-dif¬ 
ference  scheme  used  for  the  transonic  flow  cases. 

Elevon  heating  and  pitching  moment  coefficients  of  HER¬ 
MES  were  computed  with  global  flow  solutions  on  a  grid 
with  800,000  points  shown  in  Fig.  76  and  an  additional  se¬ 
ries  of  local  flow  solutions.  Fig.  77  with  deflected  elevens 
[85].  As  the  inviscid  part  of  the  flow  is  supersonic  in  axial 
direction,  the  flow  variables  in  the  inflow  plane  of  the  local 


computational  domain  for  the  rear  of  HERMES  could  be 
obtained  from  the  global  flow  solutions.  Steady-state  solu¬ 
tions  where  obtained  with  about  300  multigrid  cycles. 
Fig.  78  displays  streamlines  and  Stanton  numbers  for  the 
rear  of  the  windward  side  of  HERMES  (1.0)  configuration 
and  10°  deflection  of  elevon  and  body  flap.  The  flow  condi¬ 
tions  correspond  to  windtunnel  tests  in  ONERA  S4MA.  A 
large  separation  occurs  at  the  hinge  line  of  the  deflected 
controls.  The  computed  Stanton  numbers  are  in  good 
agreement  with  the  wind  tunnel  data.  Some  discrepancies 
occur  along  the  symmetry  line  which  were  traced  to  bound¬ 
ary  layer  transition  in  the  experiment.  Note,  that  the  exper¬ 
imental  M_^=:10  data  represents  the  highest  Mach  number, 
for  which  reliable  experimental  data  for  the  complete  con¬ 
figuration  can  be  obtained  in  Western  Europe.  However, 
flight  peak  heating  rates  occur  at  =25  and  an  trajectory 
point  of  75  km  altitude.  At  these  flow  conditions,  the  Rey¬ 
nolds  number  is  lower  than  at  M^=10  and  significant 
chemical  reactions  take  place  in  the  flow  due  to  high  tem¬ 
peratures.  These  reactions  were  taken  into  account  by  as¬ 
suming  air  in  thermochemical  equilibrium  in  our  computa¬ 
tions.  Fig.  79  displays  significant  differences  in  the  flow 
behavior  between  both  flow  conditions.  The  flow  separa¬ 
tion  almost  disappears  at  M^=25.  However,  the  heat  flux  is 
more  sensitive  to  local  flow  divergence  than  at  M^  =  10, 
that  is,  heating  increases  largely  towards  the  lateral  edges 
of  the  deflected  elevon.  Ref.  [85]  presents  a  detailed  analy¬ 
sis  of  the  flap  heating  versus  flap  efficiency  and  also  the  ef¬ 
fect  of  changing  flap  geometries. 

The  need  for  aerodynamic  control  makes  the  integration  of 
control  surfaces  for  pitch,  roll,  and  yaw  control  necessary. 
This  is  accomplished  by  defining  the  body  flap,  the  elevon 
and  the  rudder,  according  to  Fig.  75  .  The  controls  are  sized 
by  the  requirement  of  sufficient  control  surface  efficiency 
at  hypersonic  speed  and  the  maximum  deflection  angle  al¬ 
lowed  to  limit  aerodynamic  heating.  A  large  slot  between 
the  rudder  and  the  elevon  is  thus  unavoidable  due  to  the  di¬ 
hedral  of  the  winglet.  The  winglet  of  a  winged  reentry  ve¬ 
hicle  with  aerodynamic  control  has  therefore  two  edges 
which  are  exposed  to  the  incoming  flow.  These  are: 

the  leading  edge  of  the  winglet  with  a  local  maximum 
of  the  heat  flux  which  depends  on  the  angle  of  attack, 
the  geometric  angle  v)/  according  to  Fig.  75  ,  and  the 
leading  edge  radius 

-  the  lower  edge  of  the  rudder  where  an  attachment  line 
with  a  local  maximum  of  the  heat  flux  expected. 
Computations  of  peak  heating  rates  along  the  attachment 
lines  at  the  winglet  leading  edge  and  the  lower  edge  of  the 
rudder  are  reported  in  Ref.  [86].  Here,  we  will  only  present 
some  surprising  results  which  could  not  have  obtained 
without  extensive  use  of  3D  flow  computations.  As  for  the 
computations  of  flap  heating  we  have  used  a  series  of  glo- 
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bal  and  local  flow  solutions  to  compute  attachment  line 
heating.  The  local  flow  solutions  where  necessary  to  repre¬ 
sent  the  complex  geometry  of  the  slot  in  between  eleven 
and  rudder  with  a  two-block  computational  domain,  ac¬ 
cording  to  Fig.  80. 

Peak  heating  rates  along  the  heading  edges  of  the  wing  and 
winglet  of  HERMES  are  presented  in  Fig.  8 1 .  It  is  seen  that 
there  exist  a  large  sensitivity  of  winglet  heating  due  to  an¬ 
gle  of  attack.  At  higher  angles  of  attack,  the  effective 
sweep  of  the  leading  edge  increases  thereby  reducing  heat 
load.  Even  though  peak  heat  fluxes  may  be  measured  in 
wind  tunnel  tests  at  M^=10,  numerical  flow  simulations 
are  necessary  for  trajectory  points  at  higher  Mach  numbers. 
Fig.  8 1  demonstrates  that  semiempirical  correlations  in  or¬ 
der  to  collapse  peak  heating  at  different  flow  conditions  for 
simple  shapes,  i.e.  the  use  of  Stanton-Miller  numbers  of 
Ref.  [87],  do  not  neccessarily  work  well  at  the  winglet.  The 
differences  in  Stanton-Miller  numbers  between  wind  tun¬ 
nel  and  flight  condition  may  be  due  to  increased  viscous  in¬ 
teraction  at  the  lower  Reynolds  number,  and  also,  the  high 
temperature  chemical  effects  on  local  flow  angles  ahead  of 
the  winglet. 

A  completely  different  trend  is  observed  for  heating  along 
the  lower  edge  of  the  rudder.  Fig.  82  shows  that  nondimen- 
sional  heat  fluxes  reduce  by  35%  for  flight  conditions  as 
compared  to  wind  tunnel  conditions.  Inspection  of  the 
computed  flow  fields  shows  two  flow  phenomena  which 
may  be  responsible  for  this  behavior.  Firstly,  we  observe 
large  flow  separations  at  the  lateral  edges  of  the  elevon  for 
the  wind  tunnel  conditions  which  seem  to  form  a  modified 
effective  slot  shape  with  more  rapid  flow  expansion,  see 
Fig.  83.  Secondly,  the  thicker  boundary  layers  present  in 
the  flow  solution  for  flight  conditions.  Fig.  84,  tend  to 
block  the  slot  and  hence,  they  reduce  flow  expansion  and 
peak  heating  rates. 

In  conclusion  we  have  successfully  used  3D  flow  computa¬ 
tions  in  the  aerothermal  analysis  of  winged  reentry  vehi¬ 
cles.  These  computations  allow  detailed  understanding  of 
critical  flow  phenomena  and  much  more  accurate  transpo¬ 
sition  from  wind  tunnel  to  flight  as  compared  to  strategies 
used  for  the  US-Orbiter  twenty  years  ago.  Consequently, 
uncertainties  of  data  to  be  used  to  design  the  thermal  pro¬ 
tection  system  is  considerably  reduced  which  improves  the 
weight  of  space  planes. 

6.  CONCLUSION 

Well  established  algorithms  used  in  eurrent  blockstructured 
Euler/Navier-Stokes  solvers  for  industrial  applications 
have  been  reviewed.  Attention  has  been  focused  on  various 
spatial  discretization  and  time  stepping  schemes.  The  ap¬ 
proach  of  blockstructured  meshes  has  been  discussed  in  de¬ 
tail.  It  allows  the  treatment  of  complex  configurations  and 
forms  the  basis  of  parallelization  of  structured  solvers. 


Special  emphasis  has  been  put  on  the  implementation  of 
multigrid  within  a  blockstructured  solver.  Several  large- 
scale  computations  have  been  shown  which  demonstrate 
the  ability  of  current  blockstructured  flow  solvers  for  3D 
complex  applications. 
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Fig.  3  Cell-vertex  scheme 
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Fig.  5  Influence  of  grid  density  on  distributions  of 
pressure  and  skin  friction  along  RAE  2822 
airfoil  (M^  =0.73,  0=2.79',  Re^  =6.5x10^), 
central  scheme  with  scalar  dissipation 
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Fig.  7  Pressure  distributions  along  RAE  2822  airfoil 
(M„=:0.73,a=^2.79°,  Re„  =6.5x10®), 


comparison  of  central  schemes  with  scalar  and 
matrix-valued  dissipation 


Fig.  8  Skin  friction  distributions  along  RAE  2822 
airfoil  (M_^  =0.73,  a=2.79',  Re^  =6.5x10®), 
comparison  of  central  schemes  with  scalar  and 
matrix-valued  dissipation 


Fig.  9  Influence  of  grid  density  on  global  force 

coefficients  for  flow  around  RAE  2822  airfoil 
(M„=0.73,  0=2.79”,  Re„  =6.5x10®), 
comparison  of  scalar  and  matrix- valued 
dissipation 
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Influence  of  grid  density  on  distributions  of 
pressure  and  skin  frictions  along  RAE  2822 
airfoil  (M^  =0.73,  0=2.79”,  Re„  =6.5x10®), 
upwind  TVD  scheme 
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Fig.  12  Influence  of  grid  density  on  global  force 

coefficients  for  flow  around  RAE  2822  airfoil 
(M„  =0.73,  0=2.79”,  Re^  =6.5x10®),  upwind 
TVD  scheme 
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Fig.  18  Mach  contours  (AM=0.5)  for  15°  compression 
ramp  (M^  =11.68,  Re,=2.47xl05),  upwind  TVD 
scheme 


Fig.  19  Influence  of  grid  density  on  pressure  coefficient 
for  15°  compression  ramp  (M„=11.68, 

Rej.=2. 47x10^),  upwind  TVD  scheme 


Fig.  20  Influence  of  grid  density  on  skin  friction 
coefficient  for  15°  compression  ramp 
(M„  =1 1.68,  Re(,=:2.47xl0^),  upwind  TVD 
scheme 
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Fig.  21  Influence  of  grid  density  on  Stanton  number  for 
15°  compression  ramp  ( =1 1 .68, 

Rcc=2. 47x10’’),  upwind  TVD  scheme 


Fig.  22  Schematic  of  Type  IV  shock-shock  interaction 
of  Edney 
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g.  33  Pressure  coefficient,  skin  friction  and  Stanton 
number  for  15°  compression  ramp  (M^=11.68, 
Re(,=2.47xl0^),  comparison  of  upwind  TVD  and 
improved  AUSM 
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Fig.  34  Amplification  factor  for  ID  convection  problem, 
3-stage  scheme,  CFL  =  1.5,  coefficients:  0.1481, 
0.4,  1.0,  according  to  [59] 
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Fig.  35  Amplification  factor  for  ID  convection  problem, 
5-stage  scheme  with  3  evaluations  of  weighted 
dissipation  and  residual  smoothing,  CFL  =  5.0, 
P  =  1.0,  coefficients:  0.2742, 0.2067, 0.5020, 
0.5142, 1.0 


(a)  One  level,  CFL  =  5.0 


Co)  Two  levels,  full  coarsening,  CFL  =  5.0 

Fig.  36  Contour  plots  of  amplification  factor  for  2D 
convection  problem  and  all  aspect  ration  =  1 . 
Five  stage  scheme  with  3  evaluations  of 
weighted  dissipation  and  implicit  residual 
smoothing 


4-44 


Fig.  37  Contour  plots  of  the  amplification  factor  for  LU- 
SSOR  single-grid  scheme  with  central 
discretization  of  the  explicit  operator 


Fig.  38  Contour  plots  of  the  amplification  factor  for  LU- 
SSOR  single-grid  scheme  with  second-order 
upwind  spatial  discretization  of  the  explicit 
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Fig.  39  Convergence  histories  for  LU-SSOR  scheme  for 
inviscid  flow  around  NACA  0012  airfoil 
( =0.8,  a=1.25°),  numbers  in  parentheses 
indicate  the  number  of  time  steps  on  each  grid 
starting  from  the  fines  one 


Fig.  40  Convergence  histories  for  LU-SSOR  scheme  for 
viscous  flow  over  15°  compression  ramp, 
numbers  in  pharantheses  indicate  the  number  of 
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Fig.  41  Two-level  multigrid  schemes 


Fig.  48  Concept  of  fictitious  points 
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Grid  topology  and  block  boundaries  for 
incorporation  of  jet-generator 


Mach  number  distribution  in  the  symmetry 
plane  at  =0.182,  a=10°,  Pijg,/  =2.( 


Vieu'  of  grid  symmetry-plane,  jet-generator,  and 
multi-element  wins 


Streamline  pattern  in  the  symmetry  plane  at 
M^=0.I82,  a=!0°,  P,je,/  P„,  =2.0 


Fig.  59  Embedding  of  the  local,  polar  mesh  around  jet- 
generator  into  the  global  mesh  around  multi¬ 
element  wing 


Fig.  62  Enlargement  of  the  streamline  pattern  in  the 
.symmetry-plane  for  the  region  around  the  tab 
at  =0. 1 82,  a=  1 0°,  P.je,/  =2.0 
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g.  63  Streamline  pattern  on  the  lower  wing  surface 
at  =0.182,  a=10’,  P,je/  P„  =2.0 
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Fig.  65  Comparison  of  measured  and  calculated 

pressure  distributions  in  a  plane  half  an  engine 
diameter  apart  from  the  symmetry-plane  at 
=0.147,  a=10°,Ptje,/P„  =1.252 
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Fig.  64  Comparison  of  measured  and  calculated 

pressure  distributions  in  the  symmetry-plane  at 

=0.147,  Ot=10  ,  Ptje/  Pa,  =1.252 


Fig.  66  Comparison  of  measured  and  calculated 

pressure  distributions  in  a  plane  one  engine 
diameter  apart  from  the  symmetry-plane  at 
M  =0.147,  0=10",  Ptje,/  P,.  =1.252 
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Fig.  68  Three-dimensional  grid  with  selected  grid 
planes 


Fig.  69  H-C-0  grid  topology  for  consideration  of 

viscous  effects 


Fig.  70  Embedded  C-grid  at  pylon  location 
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Fig.  84  Two-dimensional  streamlines  and  Mach  number 
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SUMMARY 

This  paper  reviews  some  general  considerations  on  the  par¬ 
allelization  of  large  block  structured  flow  solvers  for  pro¬ 
duction  use.  Parallelization  is  therefore  not  treated  as  an 
isolated  subject  of  research,  but  as  a  tool  to  increase  the 
computational  power  for  the  user  and  as  integral  part  of  the 
developmental  environment  of  a  CFD  code.  As  an  example 
the  parallelization  of  the  FLOWer  code  using  the  portable 
communications  library  CLIC-3D  is  given.  Results  of 
benchmark  tests  obtained  on  various  computer  hardware 
architectures  demonstrate  today's  possibilities  of  parallel 
processing  in  CFD  applications. 

LIST  OF  SYMBOLS 
a  start-up  time  for  communication 

b  bandwidth  for  communication 

C  specific  heat  at  constant  pressure 

D  vector  of  artificial  dissipative  fluxes 
E  total  energy 

E  speed-up 

F  flux  tensor 

f  ratio  of  operations  which  cannot  perform 

concurrently 
H  total  enthalpy 

k  heat  transfer  coefficient 

Ng  number  of  blocks 

Np  number  of  processors 

n  outward  pointing  unit  normal  vector 

n  message  length 

Pjg,  relative  performance 

Pr  Prandtl  number 

p  pressure 

R  residual  vector 

S  speed-up 


T  temperature 

t  wall  clock  time 

u  velocity  vector 

u  velocity  in  x-direction 

V  volume 

V  velocity  in  y-direction 

W  vector  of  conservative  variables 

w  velocity  in  z-direction 

y  ratio  of  specific  heats 

p  viscosity 

p  density 

o  normal  stress  components 

X  shear  stress  components 

0  components  of  the  energy  dissipation  function 


Indices 

alg 

algorithmic 

comm 

communication 

i 

inviscid 

ijk 

discrete  point 

1 

laminar 

R 

reference 

t 

turbulent 

V 

viscous 

X 

in  x-direction 

y 

in  y-direction 

z 

in  z-direction 

oo 

at  infinity 
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from  15-19  May  1995  and  16-20  October  1995  at  NASA  Ames,  United  States  and  published  in  R-807. 
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1.  INTRODUCTION 

Reviewing  the  topics  of  computer  applications  of  the  last 
few  years  an  increasing  interest  in  parallel  processing  is 
observed  ,  in  CFD  as  well  as  in  other  engineering  disci¬ 
plines,  since  experts  predict  the  TFLOP/s  computer  until 
the  end  of  the  century  being  a  parallel  architecture  [1,  2]. 
Therefore,  parallel  computing  has  become  subject  of  basic 
research,  carried  out  by  mathematicians,  computer  scien¬ 
tists,  engineers  and  other  scientists  dealing  with  a  large  va¬ 
riety  of  aspects.  In  literature  one  can  find  benchmark  re¬ 
sults  for  studying  different  hardware  architectures  [3], 
discussions  on  fast  communication  protocols  [4]  or  consid¬ 
erations  on  computer  languages  supporting  parallel  pro¬ 
cessing,  e.  g.  [5,  6,  7],  Others  demonstrate  that  they  have 
parallelized  their  special  application  program  and  that  it  is 
working  reasonably  well  on  different  platforms,  e.  g.  [8,  9, 
10], 

The  paper  presented  here  will  touch  all  these  areas,  but  not 
in  detail,  because  it  shall  be  devoted  to  the  major  goal  of  all 
parallelization  effort  made  in  CFD:  The  increase  of  com¬ 
pute  power,  in  order  to  either  reduce  the  response  time  for  a 
given  problem  or  to  extend  the  problem  size  to  be  solved. 

It  should  be  kept  in  mind  that  an  engineer  applying  a  large 
CFD-code  in  general  is  not  interested  in  details  of  the  com¬ 
puter  his  program  is  running  on,  but  in  details  of  the  solu¬ 
tion  he  can  obtain,  i.  e.  in  the  aerodynamics  of  the  problem 
he  is  investigating  on.  Therefore,  in  this  paper  paralleliza¬ 
tion  is  considered  as  a  tool  improving  the  capabilities  of 
numerical  research  in  aerodynamics,  not  as  a  field  of  re¬ 
search  for  its  own  sake. 

From  this  point  of  view  the  question  must  be  asked, 
whether  parallelization  is  always  useful  and  when  should  it 
be  applied?  The  answer  is,  that  the  usefulness  of  parallel¬ 
ization  depends  on  the  program  to  be  dealt  with.  The  im¬ 
provement  in  run  time  to  be  obtained  by  any  acceleration 
technique  can  never  exceed  the  run  time  currently  needed 
to  solve  a  typical  problem,  and  an  automatic  parallelization 
is  only  possible  on  those  few  machines  where  auto-paral¬ 
lelizing  compilers  are  available.  Therefore,  a  certain 
amount  of  parallelization  effort  has  to  be  considered,  if  one 
does  not  want  to  restrict  oneself  to  a  special  hardware  envi¬ 
ronment,  such  that  the  gain  is  highest  when  parallelizing 
programs  for  large  applications. 

Secondly  it  is  questionable  to  parallelize  algorithms  which 
guarantee  a  high  parallel  efficiency  but  converge  slowly. 
Such  programs  clearly  show  an  excellent  acceleration  by 
exploiting  many  CPUs,  but  probably  reveal  longer  re¬ 
sponse  times  than  sequentially  running  algorithms  which 
converge  much  faster. 

Therefore,  only  for  large  CFD-codes  that  employ  the  most 
efficient  numerical  techniques,  the  improvement  due  to 
parallelization  will  be  the  greatest,  and  this  paper  will  deal 


especially  with  this  class  of  programs.  Furthermore  it  is  re¬ 
stricted  to  block  structured  codes,  i.  e.  to  solvers  which 
work  on  structured  grids  which  are  split  into  smaller,  inter¬ 
connected  subdomains  which  can  be  treated  separately  of 
each  other.  As  described  previously  [11],  this  is  a  standard 
technique,  in  order  to  allow  computations  of  flow  fields 
around  complex  geometries  for  which  no  structured  grid 
can  be  generated  as  one  logically  rectangular  block  for 
mathematical  reasons. 

Such  software  usually  is  the  historic  product  of  many  sci¬ 
entists  throuthout  a  long  period  and  is  applied  by  a  number 
of  different  users,  so  that  parallelization  cannot  be  treated 
as  an  isolated  problem,  but  has  to  meet  general  require¬ 
ments. 

After  identifying  some  of  them  in  the  next  section  discuss¬ 
ing  their  influence,  it  is  dealt  with  strategies  for  the  paral¬ 
lelization  of  CFD-codes  which  depend  as  well  as  on  hard- 
and  software  aspects  of  the  computer  as  on  the  type  of  pro¬ 
gram.  As  an  example  for  the  parallelization  of  a  large 
structured  flow  solver,  the  parallelization  of  the  FLOWer 
code  is  described  in  the  following  section.  This  program 
has  evolved  from  the  previously  described  DLR  standard 
flow  solver  CEVCATS  [  1 1  ]  and  is  developed  in  coopera¬ 
tion  with  the  German  national  research  center  for  computer 
science  GMD  and  the  German  aeronautical  industry  as  a 
multi  purpose  flow  solver.  Benchmark  results  obtained  on  a 
variety  of  different  parallel  computers  are  demonstrating 
the  success  of  the  approach  chosen  and  the  potential  of  par¬ 
allel  processing  in  realistic  applications. 

2.  REQUIREMENTS  FOR  THE  PARALLELIZATION 
OF  LARGE  CFD-CODES 

2. 1  Portability 

As  already  mentioned  in  the  introduction,  large  CFD-codes 
are  applied  by  a  variety  of  users,  since  otherwise  the  costs 
for  their  development  could  not  be  accepted.  Of  course  it 
cannot  be  guaranteed  that  all  these  users  are  working  on  the 
same  platform,  neither  parallel  nor  sequential.  Moreover 
the  life  time  of  such  programs  exceeds  that  of  today's  com¬ 
puters  by  far,  so  that  portability  is  an  essential  demand  for 
any  application  program  in  industrial  use. 

For  sequentially  running  codes  this  problem  can  be  circum¬ 
vented  by  restricting  the  implementation  to  standardized 
languages  for  which  compilers  exist  on  any  machine,  e.  g. 
ANSl-C  or  Fortran  77  (Fortran  90  is  still  problematic, 
since  compilers  do  not  exist  for  as  many  computers  as  for 
Fortran  77).  Furthermore  it  is  possible  to  exclude  danger¬ 
ous  programming  techniques  which  are  allowed  by  the  lan¬ 
guage  standard,  but  which  might  not  work  correctly  on  ev¬ 
ery  target  platform,  by  rigid  application  of  programming 
standards. 

For  parallel  programs  things  are  much  more  difficult.  Of 
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course,  all  techniques  for  guaranteeing  portability  in  se¬ 
quential  mode  still  apply,  but  this  is  not  sufficient,  since  the 
communication  between  different  processes  has  to  be  por¬ 
table,  too.  Up  to  now,  each  manufacturer  of  parallel  com¬ 
puters  employs  his  own  proprietary  communication  system 
being  generally  incompatible  with  that  of  others.  The  MPI- 
standard  for  message  passing  systems  [12]  has  been  estab¬ 
lished  about  one  year  ago,  but  still  implementations  are 
hardly  available,  so  that  it  is  not  yet  guaranteeing  portabil¬ 
ity. 

In  the  contrary  the  PVM  communication  system  [13]  is 
widely  spread,  but  since  it  is  public  domain  software  it 
might  be  dangerous  to  base  large  application  programs  on 
it.  In  case  of  severe  problems  nobody  would  be  responsible 
for  trouble  shooting,  and  applications  are  urgent  most  of¬ 
ten. 

A  third  possibility  to  obtain  portability  as  far  as  message 
passing  systems  are  concerned  is  the  PARMACS  library 
[14]  which  is  a  commercial  product  that  has  been  imple¬ 
mented  on  a  large  variety  of  parallel  computers.  A  defined 
path  to  MPI  is  guaranteed,  when  this  system  has  become  a 
real  standard,  but  the  popularity  of  PARMACS  is  clearly 
restricted  to  European  users. 

Even  if  a  decision  has  been  made  for  one  system  or  an¬ 
other,  still  the  problem  remains  that  parallel  computers 
might  not  be  available  to  any  user,  i.  e.  one  should  seek  for 
the  possibility  to  run  the  same  program  on  sequential  as 
well  as  on  parallel  computers. 

2.2  Consideration  of  Development  Effort 

The  development  of  large  CFD-codes  which  are  able  to 
treat  large  problems  and  complex  flow  situations  takes  a 
long  time  and  necessitates  the  experience  of  many  scien¬ 
tists  in  order  to  establish  an  efficient,  accurate  and  robust 
solver.  Furthermore  the  users  usually  have  been  working 
with  those  programs  for  a  long  time,  too,  so  that  they  are 
familiar  with  its  behavior  and  experienced  in  the  interpreta¬ 
tion  of  its  numerical  results. 

Therefore,  parallelization  must  not  result  in  the  complete 
re-implementation  of  the  flow  solver,  but  is  restricted  to 
modifications  of  the  given  code,  as  far  as  large  application 
programs  are  considered. 

2.3  Parallelization  Effort 

As  already  pointed  out,  parallelization  is  only  a  means  of 
high  performance  computing,  i.  e.  as  any  other  acceleration 
technique  its  efficiency  decides  about  its  worthiness  for  the 
user.  Unfortunately  any  larger  gain  in  efficiency  is  only 
possible  by  increasing  the  developmental  effort,  in  order  to 
gain  it.  The  latter  is  clearly  restricted  for  economical  rea¬ 
sons,  since  the  parallelization  costs  must  not  exceed  the  re¬ 
duction  of  computational  costs  for  an  institution  or  an  in¬ 
dustrial  business  as  a  whole. 


Therefore  a  parallelization  strategy  has  to  be  applied  guar¬ 
anteeing  sufficient  acceleration  with  as  little  effort  as  possi¬ 
ble. 

3.  PARALLELIZATION  STRATEGIES 

3.1  Parallel  Architectures  and  Parallelism  in  Structured 
Grid  Solvers 

Since  expectations  head  towards  some  TFLOP/s  peak  per¬ 
formance  by  parallel  processing,  a  variety  of  different  ar¬ 
chitectures  has  been  developed  attempting  to  step  further 
into  this  direction,  but  it  is  not  yet  clear  which  design  is  go¬ 
ing  to  succeed.  Generally  one  distinguishes  two  classes  of 
parallel  computers;  shared  memory  machines  where  all 
CPUs  are  coupled  by  a  common  memory  (Cray  C90)  and 
distributed  memory  machines  where  each  processor  has  its 
own  memory  unit.  In  this  case  the  nodes  are  coupled  by  an 
interconnecting  network  either  between  the  CPUs  (IBM 
SP2)  or  between  the  memory  units  (KSR  I).  Latest  devel¬ 
opments  attempt  to  combine  both  types  by  clustering  to¬ 
gether  several  processors  around  one  shared  memory  and 
connecting  these  clusters  via  network  (NEC  SX-4). 
Looking  on  the  design  of  large  structured  grid  solvers,  they 
reveal  different  levels  of  inherent  parallelism  to  be  ex¬ 
ploited.  First  of  all  on  statement  level,  operations  could  be 
performed  concurrently,  e.  g.  one  addition  and  one  multi¬ 
plication  at  a  time  on  super  scalar  processors.  Secondly  the 
grid  structure  implies  a  parallelism  of  data,  such  that  opera¬ 
tions  on  different  grid  points  could  be  carried  out  indepen¬ 
dently  which  is  already  known  from  vector  processors. 
Last  but  not  least  large  structured  grid  solvers  are  multi 
block  codes  for  grid  generation  reasons.  These  blocks  char¬ 
acterize  the  coarse  grain  parallelism  of  programs  consid¬ 
ered  here,  since  the  different  blocks  could  be  treated  con¬ 
currently. 

Comparing  machine  architecture  and  code  structure  with 
each  other  one  finds  out,  that  different  platforms  fit  to  a  dif¬ 
ferent  level  of  inherent  parallelism.  Fine  grain  parallelism 
on  statement  level  is  already  exploited  by  single  processor 
machines,  data  parallelism  seems  to  be  best  suited  for 
shared  memory  computers,  whereas  coarse  grain  parallel¬ 
ism  based  on  the  block  structure  corresponds  best  with  a 
distributed  memory  architecture.  Therefore,  computers 
combining  all  three  features  might  be  best  suited  for  struc¬ 
tured  grid  solvers,  but  until  they  are  available  one  has  to  in¬ 
vestigate  the  possibilities  of  exploiting  data  parallelism  and 
multi  block  parallelism  separately.  This  leads  to  the  ques¬ 
tion  of  how  to  perform  communication  between  proces¬ 
sors. 

3.2  Communication  Models 

According  to  the  different  machine  architectures  there  exist 
different  types  of  communication  models  which  support  ei- 
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ther  data  or  multi  block  parallelism.  Nevertheless  these 
models  are  not  restricted  to  the  corresponding  computer  ar¬ 
chitecture  and  moreover  their  implementations  are  gener¬ 
ally  incompatible  with  each  other. 

3.2.1  Parallelizing  Languages 

There  exist  attempts  to  describe  data  parallelism  already  by 
the  programming  language  such  as  high  Performance  For¬ 
tran  or  Vienna  Fortran.  However,  these  systems  have  not 
yet  reached  a  widely  accepted  standardization  level,  such 
that  portability  is  hardly  guaranteed  for  the  moment.  This 
could  be  overcome  by  current  developments  incorporating 
parallel  communication  within  objects  of  existing  object 
oriented  programming  languages  like  C++  [5,  6,  7],  but 
one  major  drawback  remains:  Any  large  solver  not  yet  im¬ 
plemented  in  such  a  language  would  have  to  be  completely 
rewritten  which  will  clearly  not  be  acceptable  for  the  rea¬ 
sons  mentioned  in  the  last  section. 

3.2.2  Compiler  Directives  and  Autotasking 

Another  data  parallel  approach  which  makes  paralleliza¬ 
tion  more  feasible  for  the  programmer  is  to  use  directives 
telling  the  compiler  which  sections  of  the  code  can  be 
treated  concurrently,  e.g.  where  loops  incorporate  data  par¬ 
allel  structures.  This  method  has  got  the  great  advantage 
that  an  existing  code  basically  remains  unchanged  and  that 
there  exist  analyzing  tools  at  least  on  some  machines,  mak¬ 
ing  suggestions  about  where  to  place  such  directives. 

The  problem  is.  that  this  procedure  has  to  be  repeated  on 
each  platform  again,  since  compiler  directives  are  naturally 
machine  dependent.  Furthermore,  experiments  employing 
autoparallelizing  compilers  have  revealed  that  best  effi¬ 
ciencies  were  always  achieved  by  putting  in  these  direc¬ 
tives  manually  increasing  the  parallelization  effort  [15]. 
The  autotasking  approach  only  assumes  that  only  data  in¬ 
corporate  parallelism,  i.  e.  only  array  data  can  be  treated  in¬ 
dependently  of  each  other,  so  that  good  efficiencies  can 
only  be  expected  from  highly  vectorizable  programs.  This 
assumption  will  generally  hold  for  structured  grid  solvers, 
but  depends  on  the  block  size  which  might  be  low  for  grid 
generation  reasons  and  which  becomes  definitely  low  on 
coarse  grids  of  multigrid  algorithms.  The  advantage  of  this 
method  is,  that  it  is  definitely  portable,  since  parallelization 
is  carried  out  automatically. 

On  virtual  shared  memory  machines,  i.e.  distributed  mem¬ 
ory  computers  which  are  programmed  as  if  they  had  a  glo¬ 
bal  shared  memory,  efficiency  decreases,  because  data 
have  to  be  transferred  by  global  communication. 

Last  but  not  least  compiler  directives  are  spread  all  over  the 
code  such  that  any  algorithmic  development  cannot  be  sep¬ 
arated  from  the  parallel  machine  where  the  code  is  running 
on. 


3.2.3  Message  Passing 

The  typical  communication  model  corresponding  to  coarse 
grain  parallelism  is  message  passing  where  the  program¬ 
mer  is  responsible  himself  for  all  types  of  communication 
between  the  different  processes.  This  means  the  program¬ 
mer  explicitly  must  tell  the  program  when  and  where  to 
send  or  receive  data  respectively  which  of  course  is  in¬ 
creasing  the  parallelization  effort.  The  advantage  of  this 
type  of  communication  model  is  its  efficiency,  since  data 
transfer  takes  place  only,  when  needed.  Moreover  all  oper¬ 
ations  can  be  performed  in  parallel,  independent  of  vector- 
ization  features. 

Of  course  portability  is  still  a  problem,  because  of  the  ven¬ 
dors  implementing  proprietary  systems,  but  as  pointed  out 
in  the  last  section,  there  already  exist  widely  spread  sys¬ 
tems  and  the  MPI-standard  allowing  an  acceptable  degree 
of  portability  today. 

On  the  contrary  to  data  parallel  communication  models,  the 
message  passing  technique  can  be  treated  independently 
from  all  algorithmical  considerations  as  far  as  single  blocks 
are  concerned.  Each  block  is  treated  the  same  way  in  the 
parallel  mode  as  in  the  sequential  mode,  and  all  communi¬ 
cation  takes  place  outside  the  block  algorithm. 

3.3  Guidelines  for  the  Parallelization  of  Block  Structured 
Flow  Solvers 

In  the  following  four  rules  will  be  given  and  explained 
which  have  proven  to  lead  to  an  efficient  parallelization 
while  meeting  the  objectives  on  large  block  structured  flow 
solvers  for  industrial  use.  Of  course  they  should  not  be  un¬ 
derstood  as  the  eternal  laws  of  parallelization,  but  they 
have  successfully  be  applied  for  parallelizing  at  least  two 
solvers  of  this  category,  i.e.  the  FLOWer  code  and  the  NS- 
FLEX-code  [16]. 

3.3.1  Grid  Partitioning  as  Parallelization  Strategy 
This  method  is  based  on  the  idea  of  splitting  a  given  grid 
into  smaller  subdomains  which  can  be  treated  indepen¬ 
dently  of  each  other.  The  arising  intersections  between  the 
different  blocks  are  treated  as  boundaries  with  a  special  cut 
condition.  In  general  there  exists  an  overlap  region  at  those 
cuts  where  data  are  copied  to  from  the  corresponding 
neighboring  block.  As  an  example  figure  1  shows  schemat¬ 
ically  the  partitioning  of  a  two-dimensional  domain  around 
an  airfoil  into  four  subdomains. 

This  technique  is  chosen,  since  it  is  an  approach  of  relative 
simplicity.  Furthermore,  this  strategy  is  agreed  to  be  the 
most  efficient  one  [17,  18],  when  solving  partial  differen¬ 
tial  equations  as  it  is  done  by  flow  solvers.  From  a  more 
practical  point  of  view,  this  method  has  got  the  great  ad¬ 
vantage  of  being  already  well  established  in  sequential 
structured  grid  solvers,  since  the  multi  block  technique  is 
nothing  else  but  grid  partitioning. 
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The  main  difference  between  a  parallel  and  a  sequential 
code  then  is,  that  the  exchange  of  boundary  data  between 
neighboring  blocks  has  to  be  replaced  by  sending  and  re¬ 
ceiving  procedures.  Another  slight  difference  concerns  glo¬ 
bal  operations  involving  all  blocks,  e.g.  the  computation  of 
the  overall  residual  which  has  to  be  realized  by  global  com¬ 
munication  techniques.  Therefore,  applying  grid  partition¬ 
ing  as  basic  strategy  is  an  approach  that  leads  straight  for¬ 
ward  to  parallelization  while  keeping  a  sequentially  proven 
algorithm  widely  unchanged. 
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Fig.  1  Schematic  multi  block  decomposition  of  the  flow 
field  around  a  generic  transport  aircraft. 

3.3.2  Separation  of  Computation  and  Communication 
Only  keeping  to  this  rule  strictly  will  allow  the  develop¬ 
ment  of  algorithms  independently  from  hardware  aspects. 
This  feature  is  necessary  with  respect  to  the  conditions  un¬ 
der  which  large  block  structured  codes  are  usually  devel¬ 
oped.  There  are  several  scientists,  engineers  or  program¬ 
mers  working  on  the  same  software,  and  one  cannot 
assume  that  all  of  them  are  sharing  the  same  parallel  super 
computer  for  development  purpose,  i.e.  for  testing,  and  de¬ 
bugging  instead  of  high  performance  computing.  Separat¬ 
ing  all  communication  operations  from  the  algorithmical 
parts  therefore  allows  the  integration  of  developments  car¬ 
ried  out  on  simple  workstations  without  problems. 
Furthermore  from  software  engineering  reasons  it  must  be 


aimed  at  a  high  degree  of  modularity  of  the  program  design 
which  enables  a  coordinated  development  by  a  group  of  re¬ 
searchers.  Any  intermixing  of  communication  and  compu¬ 
tation  would  therefore  contradict  to  this  basic  principle  of 
software  realization. 

Last  but  not  least  the  portability  problem  becomes  much 
more  feasible  to  handle,  when  all  communication  proce¬ 
dures  are  concentrated  within  separate  units  of  the  pro¬ 
gram.  Even  if  communication  systems  are  not  compatible 
with  each  other,  the  effort  for  porting  a  program  to  another 
parallel  platform  is  reduced,  since  only  defined  modules 
have  to  be  modified  or  exchanged  respectively. 

3.3.3  Communication  bv  Message  Passing 

The  decision  for  the  message  passing  programming  model 
evolves  quite  naturally  from  the  things  said  above.  As  has 
been  shown,  this  type  of  communication  corresponds  to 
coarse  grain  parallelism,  and  that  is  exactly  what  is  repre¬ 
sented  by  the  grid  partitioning  strategy  or  multi  block  tech¬ 
nique. 

Additionally,  one  gets  the  highest  efficiency,  since  parallel¬ 
ism  is  not  restricted  to  the  vectorizable  parts  of  the  code. 
One  should  never  forget  that  it  is  high  performance  com¬ 
puting  what  is  aimed  at  by  parallelization.  Another  advan¬ 
tage  is  what  programmers  might  fear  for  the  increase  of  im¬ 
plementation  effort:  communication  has  to  be  realized  by 
explicit  calls  of  system  routines  for  sending  and  receiving 
data  and  so  on.  Therefore,  the  message  passing  routines  al¬ 
ready  form  some  type  of  library  which  exists  indepen¬ 
dently  of  any  application  program,  such  that  separating 
communication  from  computation  becomes  a  simple  task. 
One  only  has  to  concentrate  all  these  routine  calls  within 
distinct  modules  of  the  program. 

Finally,  the  application  of  message  passing  does  not  ex¬ 
clude  the  possibilities  of  data  parallel  communication  mod¬ 
els  as  far  as  compiler  directives  are  concerned.  Since  mes¬ 
sage  passing  is  only  touching  the  block  structure  of  a  flow 
solver,  there  still  remains  the  inherent  data  parallelism 
within  each  block.  Therefore  a  combination  of  techniques 
involving  message  passing  for  the  inter  block  communica¬ 
tion  and  data  parallel  directives  within  each  block  might  be 
thought  of,  especially  with  respect  to  future  multi  level  ar¬ 
chitectures.  Nevertheless  drawbacks  and  advantages  of 
such  an  approach  would  have  to  be  assessed  after  practical 
experiences  have  been  made. 

3.3.4  Use  of  a  Communication  Library 

Returning  back  to  pure  message  passing  and  what  has  been 
said  about  its  features,  it  is  only  one  step  further  to  demand 
for  a  library  realizing  all  necessary  communication  in  a 
parallel  code.  Remembering  the  last  subsections  this  would 
only  be  a  more  detailed  guideline  concluding  what  has 
been  already  said,  but  it  is  more  than  that. 
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What  is  thought  about,  is  a  high  level  library  incorporating 
the  whole  functionality  involving  communication  in  block 
structured  programs,  e.g.  an  exchange  of  boundary  data  at 
block  interfaces.  Since  all  these  functionalities  must  have 
been  realized  already  in  sequential  mode,  ideally  within 
separate  modules,  portability  between  sequential  and  paral¬ 
lel  computers  is  no  problem  any  more.  One  only  has  to  ei¬ 
ther  link  different  libraries  or  call  different  subroutines  de¬ 
pending  on  the  architecture. 

Additionally,  such  a  library  can  be  developed  in  almost 
complete  independence  of  the  calling  CFD-solver,  such 
that  specialists  on  parallel  computing  could  be  employed 
for  its  implementation  guaranteeing  a  high  degree  of  reli¬ 
ability.  The  application  programmer  on  the  other  hand  is 
relieved  from  any  basic  considerations  on  parallelism.  He 
only  must  be  familiar  with  the  interfaces  to  the  library  rou¬ 
tines,  the  functionality  of  which  he  already  knows  from  his 
sequential  experience. 

Therefore,  although  the  effort  of  realizing  such  a  library  is 
high,  the  parallelization  costs  for  the  application  program 
are  low,  and,  since  a  library  can  be  re-used  again  and  again 
by  different  codes,  its  implementation  is  worthwhile.  This 
approach  is  not  a  vision  for  the  future  fairly  to  be  reached, 
but  has  already  become  reality,  and  will  be  described 
within  the  next  section. 

4.  THE  COMMUNICATIONS  LIBRARY  CLIC-3D 
At  GMD  this  approach  has  been  followed  with  the  creation 
of  the  GMD  communications  library  CLIC  („Communica- 
tions  Library  for  Industrial  Codes",  former  versions  are 
known  as  the  GMD  Comlib).  The  target  applications  are 
PDE  solvers  on  regular  and  block-structured  grids,  as  they 
result  from  finite  difference  or  finite  volume  discretiza¬ 
tions.  In  particular,  the  library  supports  parallel  multigrid 
applications.  For  this  class  of  applications  it  turned  out 
that,  while  the  numerics  differ  widely,  the  communication 
sections  are  quite  similar  in  many  programs,  depending 
only  on  the  underlying  problem  geometry.  As  a  conse¬ 
quence  of  the  high  level  abstraction,  the  CLIC  library  is 
useful  only  for  the  application  class  for  which  it  was  de¬ 
signed. 

The  development  of  the  CLIC  library  started  at  GMD  in 
1986  with  the  definition  and  implementation  of  routines  for 
2-  and  3-dimensional  logically  rectangular  grids.  It  fol¬ 
lowed  the  implementation  of  routines  for  2-dimensional 
block-structured  grids.  The  routines  for  3-dimensional 
block-structured  grids  are  currently  developed  in  the 
project  POPINDA.  The  routines  support  vertex-oriented  as 
well  as  cell-centered  discretization  schemes. 

POPINDA  is  a  German  national  project,  funded  by  the 
German  Federal  Ministry  for  Education,  Science,  Research 
and  Technology  (BMBF).  Its  central  goal  is  to  provide  the 
utilization  of  highly  parallel  systems  for  aerodynamic  pro¬ 


duction  codes.  The  parallel  codes  being  developed  in  the 
project  are  based  on  highly  efficient  numerical  algorithms 
(multigrid).  They  will  allow  more  accurate  simulations, 
which  are  indispensable  due  to  increased  economic,  eco¬ 
logical  and  technical  requirements. 

The  aim  in  the  development  of  CLIC  is  to  make  program¬ 
ming  for  complex  geometries  as  easy  as  for  a  single  cube 
and  to  provide  high  level  library  routines  for  all  communi¬ 
cation  tasks.  The  CLIC  user  interface  provides  the  applica¬ 
tion  program  with  all  required  information  about  the  prob¬ 
lem  geometry. 

The  CLIC  library  is  based  on  the  PARMACS  message 
passing  system  [14]  and,  thus,  is  designed  for  a  host-node 
(master-slave)  model.  A  host  process  starts  the  distributed 
application,  performs  the  input  and  output  and  data  trans¬ 
fers  with  the  node  processes.  The  host  process  does  not 
participate  in  the  grid  computations;  this  is  performed  by 
the  node  processes.  As  a  consequence  the  user  application 
is  separated  in  a  host  program  and  a  node  program,  as  illus¬ 
trated  by  figure  2. 


Fig.  2  Host-node-structure  of  the  parallel  FLOWer  code. 


In  the  host  program  of  a  3-dimensional  block-structured 
application,  the  same  input  parameters  are  read  as  in  the  se¬ 
quential  user  program.  Then,  CLIC-routines  read  in  the  de¬ 
scription  of  the  block-structured  grid,  create  the  node  pro¬ 
cesses,  distribute  the  blocks  in  a  load-balanced  way  to  the 
allocated  node  processors  and  distribute  the  input  parame¬ 
ters  to  the  node  processes.  Another  routine  reads  the  grid 
coordinates  and  sends  them  to  the  corresponding  node  pro¬ 
cesses.  After  the  data  is  distributed  to  the  node  processes, 
the  host  program  usually  calls  a  CLIC-routine  which  waits 
for  output  generated  by  the  node  processes  and  writes  that 
output  to  the  corresponding  output  units. 

Each  node  process  executes  the  node  program  which  is 
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very  similar  to  the  sequential  user  program  without  reading 
the  input  data.  The  input  data  is  transferred  by  CLIC-rou- 
tines,  which  receive  data  containing  the  essential  block  in¬ 
formation  of  blocks,  together  with  global  information 
passed  by  the  host  program.  The  grid  coordinates  are  also 
received  by  a  library  routine.  It  should  be  noted  that  a  node 
process  receives  the  information  and  grid  coordinates  only 
for  the  blocks  for  which  the  node  process  performs  grid 
computations.  A  schematic  flow  chart  of  the  host  and  node 
process  cooperation  is  given  in  figure  3. 


HOST  NODE  1  NODE  2 


Fig.  3  Schematic  flow  chart  of  host  and  node  process 

execution  supported  by  the  CLIC  communications 
library. 

Library  routines  also  analyze  the  block-structure;  i.  e.  for 
each  segment  edge  and  segment  point,  the  adjoining  blocks 
and  the  number  of  coinciding  grid  cells  are  determined  and 
the  edge  or  point  is  topologically  classified.  If  the  segment 
edge  or  point  is  part  of  the  physical  boundary,  the  physical 
boundary  conditions  of  all  adjoining  blocks  are  also  deter¬ 
mined.  In  addition,  the  grid  coordinates  can  be  examined, 
and  geometrical  singularities  such  as  block  faces  which 
collapse  to  a  single  point  can  be  detected.  All  that  data  can 
be  inquired  and  may  be  used  in  the  user  program,  for  ex¬ 
ample  in  the  discretization  of  irregular  grid  points  or  physi¬ 
cal  boundary  points. 

This  data  may  be  important  for  the  user  program,  however, 
it  is  essential  for  the  CLIC  library  to  correctly  update  the 
overlap  regions  (to  exchange  the  boundary  data)  of  neigh¬ 
boring  blocks  and  to  optimize  this  update  procedure.  An 
optimization  of  this  update  procedure  is  significant  for  the 


parallel  efficiency,  because  the  corresponding  CLIC-rou- 
tine  is  generally  called  most  of  all  and  is  the  most  crucial 
routine  especially  on  coarse  grids  of  multigrid  algorithms. 
An  example  for  such  an  optimization  of  the  update  proce¬ 
dure  is  regular  corners  of  8  blocks;  a  straightforward  tech¬ 
nique  to  update  the  overlap  regions  is  to  send  and  receive 
messages  over  all  faces  edges  and  corners  of  a  block;  thus, 
26  messages  (6  faces,  12  edges,  8  corners)  have  to  be  sent 
and  received  for  each  block  in  such  a  regular  case.  How¬ 
ever,  in  such  regular  cases,  the  number  of  messages  to  up¬ 
date  the  overlap  regions  can  be  decreased  to  6  by  the  tech¬ 
nique  as  follows:  in  the  first  step,  all  blocks  exchange  their 
data  with  neighbor  blocks  in  I-direction  (1  message  per 
block  face);  in  the  second  and  third  step,  all  blocks  ex¬ 
change  their  data  with  neighbor  blocks  in  J-  and  K-direc- 
tion,  respectively,  but  now  including  the  already  updated 
overlap  regions.  This  technique  is  illustrated  by  figure  4  for 
a  two-dimensional  example. 
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Fig.  4  CLIC  exchange  stragegy. 

So,  the  data  resulting  from  the  analysis  of  the  block-struc¬ 
ture  is  used  to  optimize  the  update  of  the  overlap  regions. 
Since  it  is  too  expensive  to  optimize  this  update  sequence 
and  to  determine  the  areas  which  have  to  be  sent  to  neigh¬ 
bor  blocks  within  each  update,  these  tasks  are  performed 
only  once  by  CLIC-routines  in  an  initialization  routine  of 
the  user  program.  Within  the  solution  process  of  the  user 
program,  the  update  of  the  overlap  regions  of  all  blocks  is 
then  performed  by  the  call  of  a  single  CLIC-routine.  In  that 
call,  the  user  specifies  the  number  of  the  multigrid  levels 
and  can  choose  the  number  of  grid  functions  to  be  simulta¬ 
neously  exchanged. 

Among  other  tasks,  the  CLIC  library  performs  also  the 
computation  of  global  values  (for  example  global  residu- 
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als)  and  the  output  to  files  and  standard  output  which  is 
generated  by  the  node  processes.  In  the  next  year,  the  li¬ 
brary  will  be  extended  to  adaptive  block-structures  (i.  e.  hi¬ 
erarchies  of  block-structures).  This  will  include  routines 
which  create  and  manage  adaptively  refined  new  grid  lev¬ 
els,  which  perform  a  load-balanced  dynamic  mapping  and 
which  perform  all  data  re-distribution  required  during 
adaptive  multigrid  algorithms. 

An  important  fact  for  the  development  and  management  of 
user  programs  is  that  there  is  also  a  sequential  version  of 
the  3-dimensional  block-structured  CLIC  library.  Thus,  a 
user  program  can  be  sequentially  executed  with  the  same 
interfaces  as  in  the  parallel  case. 

5.  PARALLELIZATION  OF  THE  FLO Wer  CODE 
The  development  of  the  FLOWer  code  was  initiated  within 
the  parallelization  project  POPINDA,  The  program  has  di¬ 
rectly  evolved  from  the  DLR-CEVCATS  code  [11]  and  is 
further  developed  in  close  cooperation  of  the  DLR  and  the 
German  aerospace  industry,  i.e.  DASA. 

As  the  DLR-CEVCATS  code,  the  FLOWer  code  is  written 
in  standard  Fortran  77  for  portability  reasons  and  operates 
on  block  structured  grids.  Therefore,  it  allows  computa¬ 
tions  of  flows  around  complex  aircraft  geometries  as  illus¬ 
trated  by  figure  5.  Furthermore  effort  is  made,  in  order  to 
push  all  FLOWer  development  towards  the  design  of  a 
multi  purpose  standard  code  for  a  wide  area  of  complex  ap¬ 
plications.  Since  many  different  departments  of  various  in¬ 
stitutions  of  research  and  industry  are  involved,  the  future 
FLOWer  code  must  cover  all  of  their  aerodynamical  prob¬ 
lems  reaching  from  incompressible  flows  to  hypersonics. 


Fig.  5  Generic  aircraft  configuration  consisting  of  wing, 
body,  engine  and  pylon 


5.1  Numerical  Method 

The  FLOWer  code  is  solving  the  Euler-  or  Navier-Stokes 
equations  in  conservative  form  written  as 

^JwdV  -t  J  F  ndS  =  0 
V  av 

where  W  denotes  the  vector  of  conservative  variables 


fP  ^ 


W  = 


pu 

pv 

pw 


and  F  is  the  flux  tensor  which  can  be  split  into  an  inviscid 
and  a  viscid  part: 


P  =  Fi  +  Fv 

with  the  inviscid  flux  tensor  being  defined  by 
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and  the  viscid  flux  tensor  by 
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The  components  of  the  energy  dissipation  function  are: 


<t>y  =  +  VO 

't>z  =  “\z  +  ”yz 


+  WT  - 

y  y^  ay 


+  wa. 


az 


For  the  non-dimensional  pressure  and  temperature  the  fol¬ 
lowing  relations  hold 
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P  =  P{Y-  1) 


(-f) 


The  elements  of  the  viscous  stress  tensor  are  given  by 
Newton's  law  of  skin  friction,  i.  e. 
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This  formulation  is  further  simplified  by  applying  a  thin 
shear  layer  approximation  such  that  gradients  in  stream- 
wise  direction,  i.  e.  along  quasi  streamwise  grid  lines,  are 
neglected  [19]. 

The  system  is  closed  by  the  relations  for  the  transport  coef¬ 
ficients 

P  =  Pi  +  P, 


Pi  P, 

k  =  C„ 

p[Pr,  Pr, 


where  the  laminar  viscosity  Pjis  given  by  Sutherlands's 
formula 


T  Y''/2  +  1  lOK 

T  j  T-r  IlOK 


and  the  turbulent  viscosity  being  computed  from  the  al¬ 
gebraic  Baldwin-Lomax  model  [20]. 

These  equations  are  discretized  in  space  by  the  method  of 
lines  resulting  in  a  system  of  ordinary  differential  equa¬ 
tions  involving  each  hexaeder  of  the  structured  grid 


jPijk  -  ^dS  = 


The  discretization  is  central,  but  it  can  be  switched  be¬ 
tween  a  cell  vertex  and  a  node  centered  scheme  (figure  6). 


cell  vertex  node  centered 

Fig.  6  Discretization  stars  of  the  FLOWer  code 

Therefore,  an  artificial  dissipation  term  due  to  Jameson  et 
al.  [21]  is  added  damping  high  frequency  oscillations  and 
allowing  a  sufficiently  sharp  resolution  of  shock  waves  in 
the  flow  field. 

The  resulting  system  of  equations  then  reads 


^Uijk-t^(Rijk-DijkJ  =  0 

ijk 


with  Rijk  being  the  vector  of  the  residuals  of  convective 
and  viscous  fluxes  and  Dijk  the  vector  of  the  artificial  dissi¬ 
pative  fluxes  respectively. 

The  time  integration  is  carried  out  by  an  explicit,  hybrid 
Runge-Kutta  scheme  involving  multiple  stages  [22].  The 
convergence  to  steady  state  is  further  accelerated  by  the 
techniques  of  local  time  stepping  and  an  implicit  smooth¬ 
ing  of  the  residuals  obtained  within  a  Runge-Kutta  stage. 
For  Euler  computations  there  exists  a  possibility  of  driving 
the  solution  to  steady  state  faster  by  exploiting  the  demand 
for  constant  enthalpy  [23], 

Alternatively,  a  two  stage  implicit  LU-scheme  has  been 
implemented  only  recently  and  is  currently  tested  for  the  fi¬ 
nal  integration. 

Both  iteration  techniques  are  embedded  into  a  powerful 
multigrid  algorithm  [24].  Depending  on  the  user  input  data 
standard  single  grid  computations  are  as  well  possible  as  a 
successive  grid  refinement,  simple  multigrid  or  full  multi¬ 
grid  algorithms.  As  is  illustrated  in  [11],  high  convergence 
rates  can  be  obtained,  using  this  technique. 

A  more  detailed  description  of  the  algorithms  used  can  be 
found  once  again  by  Kjoll  et  al.  [11]. 

5.2  Block  Structure 

Since  grids  around  complex  geometries  cannot  be  gener¬ 
ated  as  one  logically  rectangular  block,  the  FLOWer  code 
is  block  structured.  That  means  that  the  domain  is  split  into 
regions  for  each  of  which  the  generation  of  a  structured 
grid  is  possible.  Figure  7  is  showing  schematically  such  a 
grid  around  a  transport  aircraft.  The  program  then  treats  the 
blocks  more  or  less  independently  from  each  other  which 
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can  only  be  done  properly  by  exchanging  data  of  the  cur¬ 
rent  solution  at  block  interfaces  before  each  time  step. 


Fig.  7  Schematic  multiblock  decomposition  of  the  flow 
field  around  a  generic  transport  aircraft. 

Therefore,  each  block  is  surrounded  by  one  or  two  layers 
of  dummy  cells  respectively  which  are  used  for  the  formu¬ 
lation  of  boundary  conditions.  At  block  intersections  these 
cells  correspond  with  those  of  their  neighboring  block  and 
carry  the  solution  of  the  points  there.  This  technique  has  al¬ 
ready  been  illustrated  by  figure  1  where  a  2D  example  of 
the  block  structure  around  an  airfoil  is  given  involving  one 
dummy  layer. 

The  overlap  width  at  such  intersections  decides  about  the 
order  of  accuracy  that  could  be  obtained  at  boundaries, 
such  that  the  FLOWer  code  allows  two  dummy  layers  on 
demand  by  the  user,  in  order  to  keep  the  accuracy  at  block 
intersections  unchanged  at  second  order  in  space. 

This  number  of  dummy  layers  is  necessary  for  computing 
the  artificial  dissipation  terms  at  cuts  correctly.  Since  these 
involve  central  fourth  differences  in  space,  each  grid  point 
needs  a  support  of  two  further  vertices  to  either  side. 
Therefore,  grid  points  located  on  the  intersection  of  two 
blocks  need  information  on  data  of  two  layers  of  points 
from  their  neighbor,  i.  e.  two  dummy  layers,  in  order  to 
compute  the  artificial  dissipation  there  exactly  as  if  there 
were  no  cut. 

That  inaccuracies  at  block  interfaces  may  influence  the  so¬ 
lution  is  demonstrated  by  figures  8  and  9,  where  the  pitch- 
ingmoment  of  an  oscillating  NACA  0012  airfoil  is  plotted 
versus  the  angle  of  attack  [29].  When  involving  only  one 
layer  of  dummy  cells,  the  multiblock  solution  deviates 
from  that  obtained  by  a  single  block  computation.  Re-es¬ 
tablishing  second  order  accuracy  at  the  cuts  by  adding  the 
second  dummy  layer,  these  differences  vanish. 


Alpha 

Fig.  8  Pitching  moment  of  an  oscillating  airfoil  versus 
angle  of  attack.  Comparison  of  single  block  and 
multiblock  solution  with  one  dummy  layer  at  cuts 
[29]. 


Fig.  9  Pitching  moment  of  an  oscillating  airfoil  versus 
angle  of  attack.  Comparison  of  single  block  and 
multiblock  solution  with  two  dummy  layers  at  cuts 
[29]. 

Since  exchange  of  data  between  blocks  creates  an  addi¬ 
tional  overhead  with  respect  to  single  block  computations, 
the  FLOWer  code  inhabits  different  strategies  for  this  pro¬ 
cedure  varying  by  the  frequency  of  exchange  during  one  it¬ 
eration  [25].  They  are  sketched  in  figure  10. 

The  first  approach  contains  a  complete  exchange  at  block 
boundaries  before  each  Runge-Kutta  stage  and  before  the 
computations  of  the  residuals  for  the  forcing  functions  of 
the  multigrid  algorithm. 


5-11 


The  second  possibility  is,  to  update  the  block  interface  be¬ 
fore  each  complete  Runge-Kutta  iteration  step  and  again 
before  computing  the  residuals  for  the  forcing  functions  on 
the  coarse  grid. 

Finally,  a  third  strategy  is  carrying  out  the  data  exchange 
only  once  per  grid  level  before  the  Runge-Kutta  iteration 
step.  All  three  techniques  differ  as  far  as  the  convergence 
behavior  is  concerned  and  in  the  memory  needed,  because 
the  less  exchange  is  performed  the  less  data  have  to  be 
stored  intermediately. 
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Fig.  10  Strategies  for  exchange  of  data  at  block  interfaces. 


The  complete  procedure  is  realized  as  an  in-core  solver  as 
well  as  an  off-core  solver  locating  all  block  data  on  an  ex¬ 
ternal  storing  device.  Therefore,  large  problems  usually  ex¬ 
ceeding  the  main  memory  capacity  as  a  whole  can  be 
solved. 

5.3  Parallelization  of  the  Fl^OWer  code 
As  already  pointed  out,  the  parallelization  of  the  FLOWer 
code  followed  the  guidelines  which  were  explained  above. 
Therefore,  parallelization  meant  integration  of  calls  of  the 
CLIC-library  at  distinct  locations  within  the  code.  This 
leads  to  the  structure  of  software  layers  sketched  in  figure  8 
which  illustrates  how  parallelization  and  portability  are 
achieved  at  once. 

Since  the  CLIC-library  is  based  on  the  PARMACS  mes¬ 
sage  passing  interface,  there  are  two  different  programs 
necessary  for  a  parallel  run,  called  host  and  node  program 
(figure  2).  This  feature  was  used,  in  order  to  establish  a 
possibility  for  applying  the  FLOWer  code  as  well  as  on 
parallel  as  on  sequential  computers: 


Fig.  1 1  Software  layers  of  the  parallelized  FLOWer  code. 

The  host  program  is  only  needed  in  parallel  mode  and  per¬ 
forms  the  I/O-operations.  It  creates  and  starts  the  node  pro¬ 
cesses  and  distributes  the  initial  data  correspondingly,  i.  e. 
grid  coordinates  and  global  control  data.  During  the  pro¬ 
gram  run  the  host  process  receives  the  convergence  infor¬ 
mation  from  all  nodes  and  prints  it  to  the  standard  output. 
At  the  end  it  collects  the  solution  data  from  the  nodes  and 
writes  them  to  the  specified  output  files. 

All  parallel  output  operations  performed  by  the  host  pro¬ 
cess  are  completely  hidden  from  the  user,  since  they  are 
driven  from  the  node  process.  There  is  only  one  call  of  the 
CLIC-library  necessary,  in  order  to  initiate  the  communica¬ 
tion  between  the  host  and  the  nodes,  the  rest  is  carried  out 
automatically. 

The  node  program  contains  the  complete  sequential  flow 
solver.  There  is  only  one  parameter  to  be  specified  by  the 
user  which  switches  between  routines  of  the  CLIC-library 
and  standard  sequential  procedures.  Therefore,  the  parallel 
mode  differs  essentially  only  in  four  points  from  the  se¬ 
quential  mode: 

•  Input  read  operations  are  replaced  by  reception  of  input 
data  from  the  host  process. 

•  Global  operations  involving  all  blocks  of  the  given 
block  structure  are  performed  by  the  CLIC  instead  of 
within  do  loops  over  all  blocks. 

•  The  exchange  of  data  at  block  interfaces  is  carried  out 
fully  automatically  within  distinct  CLIC-routines. 
There  is  no  intermediate  storing  or  reordering  of  data 
necessary  any  more. 

•  Write  statements  for  putting  out  data  are  replaced  by 
parallel  write  operations  of  the  CLIC  consisting  of  an 
initialization,  an  output  format,  output  data  and  a  termi¬ 
nation  procedure. 
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These  differences  are  all  only  slight  additions  to  the  se¬ 
quential  program,  such  that  the  advantages  of  the  applied 
guidelines  described  above  become  quite  clear: 

•  The  program  is  fully  portable,  since  the  FLOWer  code 
and  the  CLIC-library  are  portable  for  themselves. 

•  All  modifications  do  not  touch  the  numerical  algorithm, 
so  that  users  and  developers  can  keep  their  well  known 
environment. 

•  The  parallelization  effort  is  extremely  low,  when 
assuming  an  existing  communications  library  CLIC. 

6.  RESULTS 

After  integrating  the  CLIC  calls  into  the  FLOWer  program 
structure,  several  test  computations  and  benchmarks  have 
been  carried  out,  in  order  to  demonstrate  the  success  of  the 
approach  chosen  and  for  investigating  the  potential  of  par¬ 
allelization  of  a  real  application  code.  These  will  be  re¬ 
ported  on  in  the  following. 

6.1  Test  Cases 

Two  different  test  cases  were  defined  for  comprehensive 
studies  of  the  performance  of  computers  and  networks  rep¬ 
resenting  typical  problems  in  aerodynamics  while  still  re¬ 
maining  simple. 

The  first  problem  to  be  solved  was  the  inviscid  flow  around 
a  non-swept  wing  consisting  of  NACA0012  airfoils  at  a 
free  stream  Mach  number  of  M  =  0.6  and  an  incidence  of 
a  =  0°,  For  this  case  two  different  grids  were  generated,  a 
coarse  one  consisting  of  160  x  32  x  8  cells  and  a  fine  one 
consisting  of  320  x  64  x  16  cells.  Both  gridsmainly  were 
subdivided  into  1,  4  and  8  blocks  of  equal  size  as  shown  in 
figure  12.  This  subdivision  was  driven  further  for  the  fine 
grid,  giving  a  16  block  and  a  32  block  case.  Each  computa¬ 
tion  consisted  of  100  multigrid  W-cyles  involving  three 
mesh  levels,  where  the  wall  clock  time  was  measured  be¬ 
tween  the  start  of  the  initialization  of  the  solution  and  the 
end  of  the  iterations. 

The  second  test  case  was  the  DLR-F4  wing-body  combina¬ 
tion,  a  generic.  Airbus  like  aircraft  given  in  figures  13  and 
14.  Here,  the  inviscid  flow  was  computed  at  a  free  stream 
Mach  number  of  M  =  0.75  at  cx  =  0°  incidence.  The  C-grid 
consists  of  256  x  40  x  40  cells  and  is  blocked  along  the  C- 
lines  into  1,  4  and  8  blocks  of  equal  size  respectively.  Each 
computation  consisted  of  35  multigrid  W-cycles  involving 
four  mesh  levels.  The  time  measurement  was  carried  out  as 
described  above. 

In  both  cases  each  block  was  mapped  to  one  processor. 


Fig.  12  Block  structure  for  the  NACA  0012  wing  test 
cases 


Fig.  13  Block  structure  of  the  DLR-F4  wing-body 
combination. 


Fig.  14  Iso-Mach-contours  on  the  DLR-F4  wing-body 
combination,  M  =  0.75,  a  =  0  °. 
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6.2  Methods  of  Performance  Assessment 
Different  quantities  have  been  evaluated,  especially  for  the 
NACA  0012  test  case,  in  order  to  assess  various  parallel 
and  sequential  computers.  This  procedure  is  necessary,  if 
one  wants  to  get  information  on  the  real  performance  of  a 
computer,  since  a  restriction  to  only  one  criterion  could 
possibly  give  a  wrong  impression  of  a  computer's  abilities. 
Therefore,  some  characteristic  values  are  defined  in  the  fol¬ 
lowing. 

6.2.1  Speed-up 

The  speed-up  gives  the  value  of  acceleration  obtained  by 
employing  several  CPUs  for  a  problem  of  a  given  size.  It  is 
defined  as 

KNp) 

which  is  the  ratio  of  wall  clock  times  needed  by  one  pro¬ 
cessor  and  by  Np  processors.  This  is  usually  compared  with 
the  number  of  processors  used  which  is  called  linear  speed¬ 
up.  The  true  speed-up  is  always  deviating  from  the  linear 
one,  because  employing  several  CPUs  is  always  creating 
an  overhead  for  communication. 

Since  the  blocking  creates  an  additional  overhead  for  com¬ 
puting  block  interfaces  multiply  (once  per  block)  by  the 
FLOWer  code,  an  algorithmic  speed-up  is  defined  as 

t(NB=I) 

which  is  the  ratio  of  computing  times  for  the  one  block 
case  and  the  Ng  block  case  on  a  single  processor  multiplied 
with  the  number  of  processors  which  could  be  employed,  i. 
e.  the  number  of  blocks.  This  value  gives  the  algorithmi¬ 
cally  possible  speed-up  for  a  given  problem. 

6.2.2  Efficiency 

The  efficiency  of  a  parallelization  denotes  the  degree  up  to 
which  the  theoretically  linear  speed-up  is  reached,  i.  e. 


Because  of  the  algorithmic  overhead  to  be  expected  by  the 
blocking,  an  algorithmic  efficiency  can  be  defined  by 


showing  the  degree  up  to  which  the  parallel  code  reaches 
the  algorithmically  possible  speed-up.  Therefore,  the  stan¬ 
dard  efficiency  is  a  global  indicator  for  the  degree  a  parallel 


code  is  exploiting  a  given  machine,  whereas  the  algorith¬ 
mic  efficiency  characterizes  the  quality  of  the  paralleliza¬ 
tion  itself. 

6.2.3  Relative  Performance 

Since  speed-up  and  efficiency  are  both  related  to  measure¬ 
ments  on  the  same  computer,  a  comparison  between  differ¬ 
ent  machines  is  mandatory  for  a  true  assessment  of  a  paral¬ 
lel  program.  Therefore,  one  can  define  a  relative 
performance 


which  is  the  ratio  of  the  computing  time  on  a  reference  pro¬ 
cessor  (usually  a  Cray  C90)  and  the  time  needed  on  the 
benchmark  machine.  This  value  allows  comparisons  even 
between  parallel  and  sequential  computers,  given  that  the 
program  will  perform  on  all  platforms. 

6.2.4  Concluding  Remarks 

Assessing  computers  using  the  above  definitions  is  still 
problematical  and  has  to  be  done  with  great  caution. 
Speed-up  and  efficiency  as  isolated  values  do  not  say  any¬ 
thing  about  the  quality  of  a  computer  or  the  program,  since 
they  lack  any  information  about  the  absolute  time  needed. 
For  example  figure  15  shows  the  speed-up  obtained  for  a 
two-dimensional  computation  of  the  flow  around  a  NACA 
0012  airfoil.  On  160  blocks  a  speed-up  of  125  was  reached 
on  a  Parsytec  GCel.  A  great  value,  but  the  time  still  ex¬ 
ceeded  that  one  obtained  on  only  four  processors  of  an 
IBM  SPl  [28] 


Fig.  15  Speed-up  versus  processor  number  on  Parsytec 
GCel  for  a  two-dimensional  NACA  0012  airfoil 
[28], 
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Furthermore  both  quantities,  as  defined  here,  are  related  to 
the  solution  of  a  problem  of  a  given  size  employing  an  in¬ 
creasing  number  of  processors.  Therefore,  it  is  necessary 
that  the  problem  can  be  solved  completely  on  a  single  CPU 
of  a  machine  for  computing  these  values.  But  paralleliza¬ 
tion  is  done  for  solving  future  problems  exceeding  today's 
single  processor  capabilities. 

Therefore,  all  results  given  here  can  only  be  taken  as  a  gen¬ 
eral  information  on  today's  abilities  of  parallel  computers 
with  respect  to  sequential  machines.  In  addition,  it  is  the 
potential  of  parallel  processing  in  CFD  which  can  be  indi¬ 
cated. 

6.3  Comparison  of  Computer  Performance 
The  NACA  0012  wing  test  case  has  been  used  for  compar¬ 
ative  performance  measurements  on  a  number  of  comput¬ 
ers  of  different  architecture.  The  results  are  given  as  histo¬ 
grams  in  figures  16  and  17  showing  the  relative  time 
needed  with  respect  to  that  one  measured  on  a  C90  single 
processor. As  one  can  see  in  figure  16,  the  workstations 
tested  cannot  compete  even  with  older  vector  computers  as 
the  Cray  Y-MP,  as  far  as  performance  is  concerned.  Their 
application  is  therefore  restricted  to  research  and  develop¬ 
ment  duties. 

Within  the  class  of  vector  computers,  the  NEC  SX-3, 
which  is  the  DLR  working  horse  is  clearly  the  strongest 
machine  outperforming  the  reference  Cray  C90  processor. 
On  this  machine  a  sustained  performance  above  1 
GFLOP/s  was  achieved. 

Switching  to  figure  17,  one  sees  that  older  parallel  comput¬ 
ers,  i.  e.  the  CMS  and  the  Intel  Paragon,  need  many  proces¬ 
sors,  i.  e.  blocks,  in  order  to  reach  a  high  performance.  For 
the  case  tested  here  consisting  of  eight  blocks,  they  can 
only  compete  with  single  processor  workstations,  since 
their  node  CPUs  are  too  weak.  Another  problem  showed  to 
be  their  little  main  memory,  such  that  the  large  test  case 
could  not  be  computed  on  them  using  less  than  32  proces¬ 
sors.  Since  the  results  for  the  coarse  grid  were  already  dis¬ 
appointing,  this  calculation  was  not  carried  out. 

Somewhat  more  promising  are  the  results  obtained  on  an 
IBM  SP 1 .  On  eight  nodes  using  the  fastest  communication 
system  available  one  can  almost  reach  the  performance  of 
older  generation  vector  processors  as  the  Cray  Y-MP. 

More  recent  distributed  memory  parallel  computers  which 
have  been  tested  then,  revealed  that  they  are  able  to  com¬ 
pete  even  with  today's  vector  machines.  If  the  problem  is 
sufficiently  large,  the  C90  single  processor  performanee 
can  almost  be  reached  employing  32  nodes  of  a  NEC 
Cenju-3.  Using  the  same  number  of  CPUs  on  an  IBM  SP2 
the  C90  single  processor  is  already  outperformed. 

A  special  case  is  the  result  of  the  J916,  since  it  is  a  shared 
memory  vector  computer.  As  one  can  see  in  this  case,  eight 


Fig.  16  Relative  performance  of  sequential  computers 
obtained  for  the  NACA  0012  wing  test  case 
(160x32x8cells)  with  respect  to  a  Cray  C90  single 
processor. 


Fig.  17  Relative  performance  of  parallel  computers 
obtained  for  the  NACA  0012  wing  test  case 
(160x32x8cells)  with  respect  to  a  Cray  C90  single 
processor. 

computing  nodes  are  already  sufficient  for  reaching  the 
C90  single  processor  performance. 

In  figure  18  the  development  of  the  relative  performance  is 
plotted  versus  the  respective  processor  number  involved 
for  the  most  powerful  parallel  computers  tested.  As  one 
can  see,  the  highest  performance  is  achieved  on  the  com¬ 
puters  with  the  most  powerful  single  processors.  On  the 
other  hand,  the  more  powerful  the  computers  are  the  worse 
their  scalability  becomes,  i,  e.  the  less  steep  is  the  slope  of 
the  corresponding  curve. 

This  clearly  indicates  that  on  all  these  computers  the  per¬ 
formance  is  mainly  gained  by  an  increase  in  the  single  pro¬ 
cessor  performance,  whereas  the  network  cannot  keep 
track.  The  better  scalability  of  the  less  powerful  parallel 
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computers  is  therefore  not  gained  by  an  improved  network 
speed,  but  by  weaker  processors  leading  to  a  better  balance 
of  both  components. 


Fig.  1 8  Development  of  the  relative  performance  versus 
processor  number  with  respect  to  a  C90  single 
processor. 

Test  case:  NACA  0012  wing,  320x64x16  cells. 

6.4  Influence  of  the  Communication  System 
It  is  clear  that  the  performance  of  parallel  computers  is 
mainly  influenced  by  the  communication  system  including 
hardware  and  software  aspects.  This  effect  was  studied 
computing  both  test  cases,  the  NACA  0012  wing  and  the 
DLR-F4  wing-body  combination,  on  an  IBM  SPl  using 
different  communication  systems  available  there.  The  re¬ 
sults  of  the  measurements  are  given  as  speed-up  versus 
number  of  processors  in  figures  19  to  21. 


Fig.  19  Speed-up  versus  processor  number  on  IBM  SPl. 
Test  case:  NACA  0012  wing,  coarse  grid 
(160x32x8  cells). 


Fig.  20  Speed-up  versus  processor  number  on  IBM  SPl. 
Test  case:  NACA  0012  wing,  fine  grid 
(320x64x16  cells). 

There  have  been  tested  five  different  communication  sys¬ 
tems: 

•  PVM  using  Ethernet 

•  PVM  using  the  IBM  High  Performance  Switch 

•  MPL  (POE) 

•  MPL/p  (euih)  in  default  configuration 

•  MPL/p  (euih)  with  interrupt  control. 

Figures  19  to  21  clearly  indicate  that  PVM  using  an  Ether¬ 
net  connection  is  not  suited  for  the  CFD  problems  treated 
here,  i.  e.  workstation  clusters  with  an  Ethernet  connection 
are  definitely  not  suitable  for  replacing  a  true  parallel  com¬ 
puter  at  least  for  the  FLOWer  code. 


Fig.  21  Speed-up  versus  processor  number  on  IBM  SPl. 
Test  case:  DLR-F4  wing-body  combination 
(256x40x40  cells). 
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The  main  reason  for  the  speed  down  on  eight  nodes  is  the 
low  performance  of  the  Ethernet  as  can  be  seen  from  the 
improvement  using  PVM  with  the  high  performance 
switch.  Nevertheless  there  is  still  too  much  software  over¬ 
head  within  the  communication  which  is  drastically  re¬ 
duced  hy  applying  the  IBM  proprietary  systems. 

With  the  fastest  systems  the  algorithmically  ideal  speed-up 
is  reached  up  to  an  acceptable  degree  depending  on  the 
problem  size.  One  can  clearly  perceive  an  increase  of  the 
speed-up  when  increasing  the  work  load  per  processor,  i.  e. 
the  block  size. 

For  the  larger  NACA  0012  wing  test  case  even  a  super  lin¬ 
ear  speed-up  was  obtained  (figure  20)  which  is  due  to  a 
paging  effect.  Indeed,  the  one  block  case  exceeded  the 
main  memory  capacity  of  a  single  CPU,  such  that  this  is  a 
typical  case  where  parallel  processing  becomes  advanta¬ 
geous  while  speed-up  measurements  are  questionable. 
Another  observation  is,  that  the  algorithmically  ideal 
speed-up  considerably  deviates  from  the  linear  one  due  to 
the  algorithmic  overhead  because  of  multiple  computations 
on  block  interfaces.  This  overhead  is  reduced  of  course, 
when  increasing  the  block  size  per  processor,  as  can  be 
seen  from  a  comparison  of  both  NACA  0012  wing  test 
cases. 

On  the  other  hand  this  overhead  is  remarkably  increasing, 
when  involving  a  fourth  multigrid  level,  as  is  done  for  the 
DLR-F4  wing-body  combination.  Since  it  is  W-cycles 
which  are  performed,  much  more  time  is  spent  on  coarse 
grids  where  the  ratio  of  boundary  points  to  field  points  is 
getting  worse.  As  it  seems,  the  FLOWer  code  behavior 
there  is  dominated  by  the  corresponding  algorithmic  over¬ 
head  at  the  inter  block  boundaries  and  not  by  the  increasing 
communication  activity,  because  the  algorithmic  ideal  is 
reached  to  a  high  degree  indicating  an  excellent  paralleliza¬ 
tion  efficiency. 

6.5  Comparison  of  Communication  Models 
Finally  it  is  possible  to  compare  the  efficiencies  of  different 
communication  models  using  the  FLOWer  code  on  shared 
memory  computers.  Therefore,  measurements  carried  out 
on  a  Cray  C916  and  a  Cray  J916  computer  using  on  the  one 
hand  the  CLIC-library,  i.  e.  exploiting  coarse  grain  paral¬ 
lelism  by  message  passing,  and  on  the  other  hand  using  the 
auto-parallelizing  compiler  distributing  parallel  data  to  dif¬ 
ferent  processors.  In  the  latter  case  the  CLIC  library  was 
replaced  by  a  dummy  library,  and  no  additional  compiler 
directives  were  used.  The  message  passing  solutions  were 
obtained  for  the  multi  block  cases,  whereas  the  auto-paral- 
lelizer  worked  on  the  single  block  problems. 

The  results  of  the  measurements  are  given  in  figures  22  to 
25  as  speed-up  versus  number  of  processors  for  the  two 
NACA  0012  wing  test  cases.  What  can  be  seen,  is  that  only 


for  the  coarse  grid  problem  the  message  passing  approach 
is  working  slightly  worse  than  the  auto-parallelization  ap¬ 
proach,  although  it  is  creating  a  considerable  algorithmical 
overhead  at  block  interfaces  as  pointed  out  above.  For  the 
fine  grid  problem  the  parallelization  via  CLIC  not  only  is 
competitive  on  the  Cray  C916,  but  even  outperforms  the 
auto-parallelization  on  the  Cray  J916. 

Furthermore  the  scalability  of  the  loop  based  data  parallel 
approach  is  rather  poor  which  is  indicated  by  the  strong 
non-linear  deviation  of  the  speed-up  from  the  linear  one. 
Using  message  passing  this  deviation  is  higher  for  small 
processor  numbers,  but  remains  almost  linear,  at  least  until 
eight  CPUs,  such  that  it  is  to  be  expected  that  this  approach 
is  performing  better  for  large  processor  numbers. 


Fig.  22  Speed-up  versus  processor  number  on  Cray  C916. 
Test  case:  NACA  0012  wing,  coarse  grid 
(160x32x8  cells). 


Fig.  23  Speed-up  versus  processor  number  on  Cray  C916. 
Test  case:  NACA  0012  wing,  fine  grid 
(320x64x16  cells). 
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Fig.  24  Speed-up  versus  processor  number  on  Cray  J916. 
Test  case:  NACA  0012  wing,  coarse  grid 
(160x32x8  cells). 


Fig.  25  Speed-up  versus  processor  number  on  Cray  J916. 
Test  case:  NACA  0012  wing,  fine  grid 
(320x64x16  cells). 


The  reasons  for  that  interesting  behavior  might  be  the  fol¬ 
lowing  [26]: 

First  of  all,  employing  the  CLIC-library  creates  some  soft¬ 
ware  overhead  necessary  for  the  operations  involved  in  the 
communication.  In  addition,  the  algorithmic  overhead  due 
to  multiple  computations  at  block  boundaries  further  de¬ 
creases  the  parallel  efficiency  to  be  obtained  by  the 
FLOWer  code.  This  explains  the  somewhat  expected  be¬ 
havior  for  small  processor  numbers,  that  a  vendor's  spe¬ 
cific  strategy  outperforms  a  portable  one. 

On  the  other  hand  the  auto-parallelizing  approach  is  re¬ 
stricted  to  the  distribution  of  array  data  within  loops  to  dif¬ 


ferent  processors.  Of  course  there  remains  a  body  of  opera¬ 
tions  outside  of  loops,  i.  e.  scalar  operations.  These  are 
excluded  from  the  parallelization  using  an  auto-paralleliz¬ 
ing  compiler,  but  of  course  take  part  in  the  coarse  grain 
parallelization  based  on  the  block  structure.  Therefore,  the 
number  of  operations  which  cannot  be  performed  in  paral¬ 
lel  is  higher  for  the  data  parallel  approach  than  for  the  mes¬ 
sage  passing  approach.  Due  to  Amdahl’s  law  [27] 

Np 

^  “  1  +f-  (Np-1) 

where  f  is  the  portion  of  operations  which  cannot  perform 
concurrently,  this  must  lead  to  a  higher  speed-up  theoreti¬ 
cally  to  be  obtained  by  the  parallelization  via  CLIC,  since 
the  value  of  f  is  smaller  there  reducing  the  denominator  of 
the  above  expression. 

Another  reduction  of  the  speed-up  gained  by  auto-parallel- 
ization  is  caused  by  small  load  imbalances  which  are  indi¬ 
cated  by  the  small  wiggles  of  the  speed-up  curves  for  that 
approach.  Depending  on  the  strategy  chosen  for  the  distri¬ 
bution  of  concurrently  processed  data  and  depending  on 
the  number  of  array  data  to  be  treated,  it  can  hardly  be 
avoided  that  there  will  be  processors  computing  slightly 
more  data  than  others  reducing  further  the  efficiency.  On 
the  contrary  the  blocks  of  the  test  cases  for  the  message 
passing  parallelization  were  of  equal  size  guaranteeing  an 
ideal  load  balancing  for  that  strategy. 

What  can  be  observed  further,  is  that  both  techniques  per¬ 
form  better  with  an  increase  of  the  problem  size  which  is 
due  to  an  increase  of  the  vector  length  in  either  approach. 
But  in  the  message  passing  solution  additionally  the  com¬ 
munication  and  the  algorithmic  overhead  becomes  less  im¬ 
portant,  since  the  local  ratio  of  boundary  data  to  field  data 
is  getting  better.  Therefore,  the  message  passing  efficiency 
becomes  less  dependent  on  the  processor  number,  when  in¬ 
creasing  the  problem  size  while  keeping  the  work  load  con¬ 
stant  per  CPU. 

The  figures  22  to  25  clearly  indicate  that  message  passing 
is  superior  to  auto-parallelizing  compilers  for  sufficiently 
large  blocks  and  for  a  sufficiently  balanced  ratio  of  com¬ 
munication  to  computational  power.  The  latter  is  demon¬ 
strated  by  the  Cray  J916  results  where  the  single  CPU  is 
much  weaker  than  that  one  of  a  Cray  C916,  but  where  the 
message  passing  speed-ups  are  always  above  the  corre¬ 
sponding  ones  on  the  Cray  C916.  Therefore,  the  crossover 
point  for  the  message  passing  approach  is  reached  earlier 
than  on  the  C916  machine. 

Of  course  these  conclusions  are  only  valid,  when  no  addi¬ 
tional  compiler  directive  are  put  into  the  code  for  tuning, 
but  this  was  excluded,  for  the  reasons  given  with  the  guide¬ 
lines  for  parallelization. 
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6.6  Massive  Parallelism 

It  is  clear  that  the  benefits  of  parallelism  will  be  greatest, 
when  applying  a  higher  number  of  processors  assuming  a 
sufficiently  powerful  network.  There  exist  attempts  to  em¬ 
ploy  hundreds  or  thousands  of  CPUs  working  on  the  same 
problem  at  an  overall  performance  of  about  1  TFLOP/s. 
Therefore  investigations  with  a  two-dimensional  code  were 
carried  out,  in  order  to  study  the  effects  occuring  in  a  mas¬ 
sively  parallel  environment  [28]. 

There  were  standard  computations  carried  out  on  a  Par- 
sytec  GCel  for  the  flow  field  around  a  NACA  0012  airfoil 
at  M  =  0.8  and  a  =  1.25°  on  an  0-grid  of  320  x  64  cells. 
The  mesh  was  split  according  to  different  strategies  in  up 
to  160  blocks  of  equal  size.  The  block  structures  varied 
with  respect  to  the  direction  the  grid  was  split,  i.  e.  the 
mesh  was  subdivided  in  the  normal  direction  j  into  1,  2,  4 
and  8  blocks  keeping  the  number  of  blocks  constant  in  the 
circumferential  direction  i  at  1,  5,  10  or  20  blocks  respec¬ 
tively.  The  results  of  these  computations  are  shown  in  fig¬ 
ure  26  where  the  obtained  efficiencies  are  plotted  versus 
the  respective  number  of  processors. 


Fig.  26  Efficiency  versus  processor  number  for  different 
block  structures  for  the  2d  NACA  0012  airfoil  on 
Parsytec  GCel. 

As  one  can  see,  the  efficiency  varies  remarkably  between 
the  different  strategies  applied.  Moreover,  it  is  essentially 
determined  by  the  number  of  blocks  in  the  normal  direction 
j.  The  efficiency  values  for  8  blocks  in  j  direction  differ 
only  between  85%  and  81%,  although  the  total  number  of 
blocks,  i.  e.  processors,  covers  a  range  from  8  to  no  less 
than  160. 

The  reason  for  that  behavior  is  found,  when  thinking  about 
the  communication  pattern  in  the  different  cases.  The  time 
needed  for  communicating  a  set  of  data  is  usually  given  by 


the  following  linear  relation 


where  a  is  the  start-up  time  needed  for  initialization,  b  the 
bandwidth  and  n  the  number  of  data  to  be  transferred.  The 
latter  usually  is  proportional  to  the  number  of  points  at  a 
block  interface. 

In  the  case  here  the  number  of  messages  sent  per  block  de¬ 
pends  only  on  the  number  of  neighbors.  This  value  varies 
slightly  from  2(1  block  in  any  direction)  to  4  (4  or  8  blocks 
in  any  direction)  in  the  worst  case,  but  this  does  not  differ 
between  a  blocking  in  i-  and  in  j-direction.  What  counts,  is 
that  the  blocks  resulting  from  a  splitting  in  the  normal  di¬ 
rection  j  always  have  longer  edges  along  j  than  along  i  in 
the  range  considered,  such  that  the  length  of  at  least  part  of 
the  messages  is  always  longer  in  j-  than  in  i-direction  (fig¬ 
ure  27).  Therefore,  for  this  test  case  it  is  always  profitable 
to  achieve  a  given  number  of  blocks  by  slicing  the  grid  in 
the  circumferential  direction  instead  of  a  blocking  in  nor¬ 
mal  direction. 

Therefore,  when  blocking  a  problem  for  parallelization 
purpose  one  should  not  only  think  of  the  load  balancing 
problem,  i.  e.  to  produce  equally  sized  subdomains,  but 
also  of  an  optimum  grid  partitioning  with  respect  to  the 
communication  pattern. 


Fig.  27  Schematic  block  structure  around  the  2d  NACA 
0012  airfoil. 

7.  CONCLUSIONS 

It  has  been  shown  that  parallelization  is  an  interesting 
method  for  accelerating  large  CFD  solvers  for  production 
use,  but  for  this  class  of  programs  parallelization  cannot  be 
treated  in  isolation.  Moreover  the  requirements  for  porta¬ 
bility,  conservation  of  the  effort  spent  for  the  numerical  de¬ 
velopment  in  sequential  mode  and  reduction  of  the  paral- 
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lelization  effort  have  to  be  met.Therefore,  guidelines,  how 
to  proceed,  are  given  which  have  been  proven  to  lead  to  a 
parallel  code  fulfilling  these  industrial  objectives  on  soft¬ 
ware. 

The  strategy  suggested  is  based  on  grid  partitioning  using 
message  passing  for  communication,  since  this  technique 
corresponds  to  the  well  known  multiblock  approach  in  se¬ 
quential  programs.  All  functionalities  involving  communi¬ 
cation  between  parallel  nodes  should  be  concentrated 
within  a  high  level  library  guaranteeing  portability  and 
simplifying  the  parallelization  task. 

The  communications  library  CLIC  which  is  currently  de¬ 
veloped  at  the  GMD  within  the  POPINDA-proJect  is  such  a 
library.  Based  on  the  portable  message  passing  interface 
PARMACS  it  is  supporting  any  block  structured  program. 
As  an  example  the  parallelization  of  the  FLOWer  code  is 
described  which  is  developed  for  production  use  in  aerody¬ 
namics.  It  is  demonstrated  that  the  chosen  approach  using 
the  CLIC  library  allows  this  program  to  run  on  computers 
of  any  architecture  ranging  form  single  processor  worksta¬ 
tions  up  to  shared  and  distributed  memory  parallel  ma¬ 
chines. 

Comparisons  of  performance  data  obtained  with  the 
FLOWer  code  show  that  modem  parallel  computers  are  al¬ 
ready  able  to  reach  the  single  processor  performance  of  a 
Cray  C90  processor  employing  a  moderate  number  of 
nodes. 

Studies  on  different  communication  systems  demonstrate 
that  the  communication  performance  clearly  determines 
the  potential  of  parallel  processing.  As  it  comes  out,  work¬ 
station  clusters  connected  by  Ethernet  are  definitely  not 
suitable  for  replacing  true  parallel  computers,  at  least  for 
CFD  applications  of  the  FLOWer  code. 

A  comparison  of  different  parallelization  techniques  on 
shared  memory  computers  reveals  that  the  portable  mes¬ 
sage  passing  approach  suggested  is  not  necessarily  inferior 
to  vendor's  auto-parallelizing  compilers.  It  was  demon¬ 
strated  that  only  for  small  processor  numbers  the  FLOWer 
code  performs  worse  using  the  CLIC-library,  but  the  seal- 
ability  features  of  the  message  passing  communication 
model  appeared  to  be  generally  better  than  that  of  the  data 
parallel  model  involving  an  auto-parallelizer. 

Finally,  studies  on  the  behaviour  of  different  block  struc¬ 
tures  reveal  a  strong  influence  of  the  grid  partitioning  on 
the  resulting  communication  amount  yielding  remarkable 
differences  of  the  efficiency  to  be  obtained  in  a  parallel  run. 

8.  OUTLOOK 

Further  development  is  to  be  carried  out  for  the  future,  in 
order  to  improve  the  parallel  behavior  of  the  FLOWer 
code.  Major  effort  will  have  to  spent  on  the  reduction  of 
the  algorithmical  overhead  at  block  intersections  for  in¬ 


creasing  the  absolute  speed-up  rates. 

Furthermore  investigations  on  the  parallelization  features 
of  the  program  have  to  be  devoted  to  Navier-Stokes  com¬ 
putations,  since  up  to  now  only  Euler  results  have  been 
studied. 

Finally  the  integration  of  a  local  grid  refinement  has  to  be 
done  within  the  research  project  POPINDA  involving  as 
well  the  FLOWer  code  as  the  communications  library 
CLIC.  Additional  features  of  this  library  will  be  realized  in 
the  near  future,  i.  e.  an  automatic  load  balancing  and  a  spe¬ 
cial  detection  and  treatment  of  mesh  singularities. 
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Summary 

Consideration  is  given  to  the  techniques  required  to  sup¬ 
port  adaptive  analysis  of  automatically  generated  unstruc¬ 
tured  meshes  on  distributed  memory  MIMD  parallel  com¬ 
puters.  The  key  areas  of  new  development  are  focused 
on  the  support  of  effective  parallel  computations  when 
the  structure  of  the  numerical  discretization,  the  mesh, 
is  evolving,  and  in  fact  constructed,  during  the  compu¬ 
tation.  All  the  procedures  presented  operate  in  parallel 
on  already  distributed  mesh  information.  Starting  from  a 
mesh  definition  in  terms  of  a  topological  hierarchy,  tech¬ 
niques  to  support  the  distribution,  redistribution  and  com¬ 
munication  among  the  mesh  entities  over  the  processors 
is  given,  and  algorithms  to  dynamically  balance  proces¬ 
sor  workload  based  on  the  migration  of  mesh  entities  are 
given.  A  procedure  to  automatically  generate  meshes  in 
parallel,  starting  from  CAD  geometric  models,  is  given. 
Parallel  procedures  to  enrich  the  mesh  through  local  mesh 
modifications  are  also  given.  Finally,  the  combination  of 
these  techniques  to  produce  a  parallel  automated  finite 
element  analysis  procedure  for  rotorcraft  aerodynamics 
calculations  is  discussed  and  demonstrated. 
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Nomenclature 

Notation  used  to  describe  models  and  topological  entities 
within  the  models 


Domain  associated  with  model  v,v  —  g,p  or  m 
where  g  signifies  the  geometric  model,  p 
signifies  the  partition  model,  and  m  signifies  the 
mesh  model. 

~Q  Closure  of  domain  associated  with  the  model 
V,  V  =  g,p  or  m 

Topological  entity  i  from  model  v  of  dimension 
d,  d  =  0  is  a  vertex,  d  =  1  is  an  edge,  d  =  2  is 
a  face,  d  =  3  is  a  region. 

kjid  Indicates  the  fcth  use  of  the  topological  entity 
^  .  Use  entities  uniquely  identify  how  entities 

are  used  in  non-manifold  models.  The  simplest 
case  of  uses  arises  from  the  fact  that  a  face  can 
be  bounding  two  regions.  One  face  use  is 
associated  with  each  region. 

The  ±  indicates  a  directional  use  of  the 
topological  entity  as  defined  by  its  ordered 
definition  in  terms  of  lower  order  entities.  A  -f 
indicates  use  in  the  same  direction,  a  — 
indicates  use  in  the  opposite  diection. 
d{^T^)  Boundary  of  topological  entity 
^Tf,v  =  g,p  or  m 

Closure  of  topological  entity  defined  as 
{yTf  U  d{^Tf)) ,  V  =  g,p  or  m 
C  Classification  symbol  used  to  indicate  the 

association  of  one  or  more  entities  from  one 
model,  typically  m  or  p,  with  a  higher  model, 
typically  p  or  g 


Groups  of  topological  entitles  used  in  the  definition  of 
topological  adjacencies 


Unordered  group  of  n  topological  entities  of 
dimension  d 

Ordered  group  of  n  topological  entities  of 
dimension  d 

Cyclicly  ordered  group  of  n  topological 

entities  of  dimension  d 

Group  of  n  topological  entities  of  dimension 

d  without  order  specified 

The  ith  entity  in  a  group  of  n  topological 

entities  of  dimension  d 


Notation  used  to  describe  adjacency  relationships  for 
topological  entities 


(")  Set  of  n  topological  entities  of  dimension  d 
adjacent  to,  or  contained  in  ip.  (p  may  be  an 
entity,  \  or  a  group  of  entities,  {vT'^) 


Examples  of  adjacency  groups 

All  model  entities  of  dimension  d  in  model  v 


v\  T'^] .  The  ith  entity  of  dimension  d  in  model  v  in 
'  the  group.  Note  that 

rpd,  r  rpdj  1  (”)  The  n  entities  of  dimension  dj  adjacent 
to  entity  ’ 

Adjacency  relationships  are  evaluated  left  to  right.  For 
example  {„r°}{„T^}  is  found  by  first  finding  the 
group  defined  by  y?  =  and  then  by  defining 

the  group 


1.  Introduction 

Adaptive  techniques  provide  the  promise  of  reliably  solv¬ 
ing  many  complex  flow  problems  to  the  desired  level  of 
accuracy.  The  computational  requirements  of  these  solu¬ 
tion  processes  can  only  be  met  by  scalable  parallel  com¬ 
puters.  The  development  of  effective  parallel  algorithms 
for  adaptive  techniques  is  challenging  due  to  the  irregular 
nature  of  adaptive  discretizations  and  the  constant  mod¬ 
ification  of  the  discretization.  These  notes  discuss  the 
techniques  required  to  support  automated  adaptive  analy¬ 
sis  on  distributed  memory  MIMD  parallel  computers. 
Three  assumptions  underlying  the  techniques  presented 
are  (i)  the  parallel  computation  algorithms  assume  a  par¬ 
titioning  of  the  mesh  onto  the  processors,  (ii)  the  meshes 
are  unstructured,  and  (iii)  the  mesh  generation  and  enrich¬ 
ment  processes  interact  directly  with  a  geometric  defini¬ 
tion  of  the  domain  being  analyzed  as  it  exists  in  a  CAD 
system.  These  assumptions  have  a  defining  influence  on 
the  procedures  developed.  The  most  critical  of  the  as¬ 
sumptions  is  the  direct  link  to  the  CAD  definition  of  the 
domain  which  allows  the  adaptive  procedures  to  solve  the 
problem  over  the  intended  domain,  not  some  approxima¬ 
tion  based  on  an  initial  mesh.  The  results  of  our  adaptive 
CFD  calculations  clearly  demonstrate  that  adaptive  results 
in  which  the  mesh  enrichments  do  not  improve  the  geo¬ 
metric  approximation  often  yield  no  improvement  in  the 
solution  accuracy.  This  is  because  the  adaptive  procedure 
is  obtaining  a  better  solution  to  the  wrong  problem. 

A  key  aspect  to  supporting  calculations  on  adaptively 
evolving  mesh  is  the  data  structure  used  to  describe  the 
mesh  and  support  its  evolution  during  the  adaptive  analy¬ 
sis  process.  When  the  analyses  are  performed  on  parallel 
computers,  capabilities  must  be  available  to  support  the 
communications  between  the  partitions  of  the  mesh  as¬ 
signed  to  various  processors.  As  the  mesh  is  adapted, 
partition  work  load  becomes  unbalanced,  therefore  pro¬ 
cedures  must  be  available  to  effectively  modify  the  mesh 
partitions  to  regain  load  balance  for  the  next  computa¬ 
tional  step.  Chapter  2  of  these  notes  presents  a  set  of 
data  structures  and  algorithms  for  the  effective  parallel 
control  of  evolving  meshes. 

The  demand  for  continuously  larger  meshes  indicates  the 
need  for  the  development  of  efficient  parallel  automatic 
mesh  generators  which  can  operate  directly  from  the  geo¬ 
metric  representations  housed  in  CAD  systems.  Chapter 
3  of  these  notes  discusses  the  issues  of  automatic  mesh 
generation  from  solid  models  and  presents  an  algorithm 
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for  parallel  mesh  generation.  Although  the  mesh  enrich¬ 
ments  dictated  by  an  adaptive  analysis  can  be  satisfied 
through  remeshing  by  the  automatic  mesh  generator,  the 
computational  cost  and  need  to  project  parameters  be¬ 
tween  meshes  indicates  the  desire  to  employ  alternative 
mesh  enrichment  techniques  when  possible.  Chapter  4 
presents  a  set  of  local  mesh  modification  procedures  for 
the  effective  refinement  and  coarsening  of  meshes. 

Given  a  set  of  parallel  procedures  for  controlling  mesh 
partitions,  for  the  generation  and  enrichment  of  the  mesh, 
the  remaining  ingredient  of  the  automated  adaptive  analy¬ 
sis  is  the  adaptive  solver.  Consistent  with  the  other  com¬ 
ponent  procedures  presented  in  these  notes,  it  is  assumed 
that  the  solver  operates  on  an  unstructured  mesh  which 
has  been  partitioned  to  the  various  processors  of  the  par¬ 
allel  computer.  Under  this  assumption,  adaptive  finite 
volume  and  finite  element  solvers  are  most  appropriate. 
Chapter  5  presents  the  structure  of  such  a  solver.  The 
specific  solver  discussed  is  a  finite  element  based  proce¬ 
dure  which  builds  directly  on  the  parallel  mesh  control 
tools  of  the  earlier  sections. 

2.  Parallel  Control  of  Evolving  Meshes 

Central  to  the  parallel  automated  adaptive  analysis  proce¬ 
dures  considered  here  are  tools  to  control  the  mesh  and  its 
distribution  among  the  processors  as  the  meshes  are  gen¬ 
erated  and  analyzed.  These  tools  must  be  able  to  maintain 
load  balance  as  the  mesh  evolves  during  the  computations 
in  such  a,  manner  that  the  interprocessor  communications 
are  kept  as  small  as  possible.  Its  is  also  critical  that  these 
procedures  operate  in  parallel  and  scale  as  the  problem 
size  grows  so  they  do  not  become  the  bottleneck  in  the 
parallel  computation  process. 

The  tools  required  to  support  parallel  automated  adaptive 
analysis  include: 

1.  data  structures  and  operators  to  support  the  model 
representations  employed 

2.  interprocessor  communication  control  mechanisms 

3.  mechanisms  to  effectively  move  portions  of  the  dis¬ 
crete  models  generated  to  various  processors  so  load 
balance  can  be  maintained 

4.  techniques  to  partition  the  mesh  among  the  proces¬ 
sors  so  the  load  is  balanced  and  communications  are 
minimumized 

5.  techniques  to  up-date  the  mesh  partitions  to  regain 
load  balance  which  was  lost  due  to  mesh  modifica¬ 
tions 

The  minimum  data  structures  needed  for  an  automated 
adaptive  analysis  are  (i)  a  problem  definition,  in  terms  of 
a  geometric  model  and  analysis  attributes,  and  (ii)  a  mesh, 
which  the  discrete  representation  used  by  the  analysis  pro¬ 
cedures.  The  next  section  describes  a  general  structure, 
based  on  boundary  representations,  for  the  problem  def¬ 
inition  and  the  mesh.  This  same  form  of  structure  is 
used  to  support  the  partition  model  used  by  the  partition 
operators,  mesh  migration  procedures  and  dynamic  load 


balancing  procedures.  In  additional  to  these  data  struc¬ 
tures,  several  procedures  described  employ  tree  structures 
to  support  searching  and  spatial  enumeration.  The  mesh 
partition  procedures  described  in  section  2.2  are  designed 
to  effectively  collect  groups  of  mesh  entities  for  migra¬ 
tion  and,  using  the  interprocessor  communication  oper¬ 
ators,  transfer  the  information  and  update  all  local  data 
structures  as  needed. 

A  number  of  algorithms  have  been  developed  to  partition 
a  given  mesh  to  a  set  of  processors.  The  interested  reader 
is  referred  to  references  [4,  20,  21,  56,  80]  for  more 
information.  The  current  document  focuses  on  procedures 
to  update  an  existing  set  of  mesh  partitions  after  the 
mesh  has  been  modified  by  a  mesh  adaptation  procedure. 
Section  2.3  presents  two  classes  of  procedures  for  this 
purpose. 

2.1.  Mesh  Data  Structure  to  Support 
Geometry-Based  Automated  Adaptive  Analysis 

The  classic  unstructured  mesh  data  structure  of  a  set 
of  node  point  coordinates  and  element  connectivities  is 
not  sufficient  for  supporting  automated  adaptive  analysis. 
Richer  structures  are  required  to  support  adaptive  mesh 
enrichment  procedures  and  to  provide  the  links  to  the 
original  domain  definition  needed  by  critical  functions, 
including  ensuring  that  the  automatic  mesh  generator  has 
produced  a  valid  discretization  of  the  given  domain.  A 
number  of  alternative  mesh  data  structures  have  been 
proposed  for  various  forms  of  mesh  adaptation.  Instead 
of  describing  and  comparing  these  structures,  a  general 
data  structure  based  on  a  hierarchy  of  topological  entities 
is  given. 

The  goal  of  an  analysis  process  is  to  solve  a  set  of  par¬ 
tial  differential  equations  over  the  geometric  domain  of 
interest,  Generalized  numerical  analysis  procedures 
employ  a  discretized  version  of  this  domain  in  terms  of 
a  mesh.  Since  the  mesh  domain,  may  not  be  identi¬ 
cal  to  the  original  geometric  domain,  gQ,  and/or  various 
procedures,  such  as  automatic  mesh  generation,  adaptive 
mesh  refinement  and  element  stiffness  integration  need 
to  understand  the  relationship  of  the  mesh  to  the  geo¬ 
metric  model,  it  is  critical  to  employ  a  representational 
scheme  which  maintains  the  relationships  between  these 
two  models.  Although  a  number  of  schemes  are  possi¬ 
ble  for  defining  a  geometric  domain  [58],  the  most  ad¬ 
vantageous  for  the  current  purposes  are  boundary-based 
schemes  in  which  the  geometric  domain  to  be  analyzed 
is  represented  as 

(1) 

where  g{gS}  represents  the  information  defining  the 
shape  of  the  entities  which  define  the  domain  and  9{gT} 
represents  the  topological  types  and  adjacencies*  of  the 

*  In  the  context  of  a  domain  representation,  adjacencies 
are  the  relationships  among  topological  entities  which 
bound  each  other.  For  example,  the  edges  that  bound  a 
face,  is  a  commonly  used  topological  adjacency. 
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entities  which  define  the  domain.  In  addition  to  being 
unique,  the  use  of  topological  entities  and  their  adja¬ 
cencies  provides  a  convenient  abstraction  for  defining 
the  relationship  of  different  models  of  the  same  domain. 
Boundary  representations  also  allow  the  convenient  spec¬ 
ification,  with  respect  to  the  geometric  domain,  of  the 
analysis  attributes  of  material  properties,  loads,  boundary 
conditions  and  initial  conditions  [72,  75].  An  additional 
advantage  of  boundary  representations  is  the  fact  that  cur¬ 
rent  computer  aided  design  systems  support  a  boundary 
representation  of  the  domains  defined  within  them.  This 
allows  the  effective  combination  of  these  packages  with 
automatic  mesh  generation.  A  final  advantage  of  recent 
boundary  representations  are  their  ability  to  properly  rep¬ 
resent  the  non-manifold  geometric  domains  commonly 
used  for  analysis  processes  [89,  32]. 

Since  individual  volume  finite  elements  will  be  limited 
to  simple  regions,  bounded  by  simply  connected  faces, 
consideration  of  the  topological  entities  for  a  model  can 
focus  on  the  basic  0  to  d  dimensional  topological  entities, 
which  for  the  three-dimensional  case  {d-3)  are: 

(2) 

where  ,  d  =  0, 1, 2, 3  are  respectively  the  set  of 

vertices,  edges,  faces  and  regions  defining  the  primary 
topological  elements  of  the  domain^. 

Critical  to  the  understanding  of  the  relationship  of  the 
mesh  with  the  geometric  domain  is  the  concept  of  classi¬ 
fication  of  a  derived  model  to  its  parent  model  [66,  67]. 
Definition:  Mesh  Classification  Against  the  Geometric 
Domain  —  The  unique  association  of  a  topological  mesh 
entity  of  dimension  di,  ^  Tf',  to  a  topological  geometric 
domain  entity  of  dimension  dj,  gTjf  where  di  <  dj,  is 
termed  classification  and  is  denoted 

T^'  r  (3) 

where  the  classification  symbol,  C,  indicates  that  the  left 
hand  entity,  or  set,  is  classified  on  the  right  hand  entity. 
Multiple  „  r/'  can  be  classified  on  a  ^  Tf’ .  A  mesh 
region,  is  classified  in  the  domain  region,  gTf,  in 

which  it  lies.  A  mesh  face,  is  classified  in  the 

domain  region,  gTf,  in  which  it  lies,  or  on  the  domain 
face,  gTf,  on  which  it  lies.  A  mesh  edge,  mT},  is 
classified  in  the  domain  region,  gTj,  in  which  it  lies,  on 
the  domain  face,  gTf,  on  which  it  lies,  or  on  the  domain 
edge,  gT},  on  which  it  lies.  Finally,  a  mesh  vertex, 
is  classified  in  the  domain  region,  gTf,  in  which  it  lies, 
on  the  domain  face,  gTf,  on  which  it  lies,  on  the  domain 
edge,gT/,  on  which  it  lies,  or  on  the  domain  vertex, 

^  Proper  consideration  of  general  geometric  domains  re¬ 
quires  consideration  of  the  loop  and  shell  topological 
entities,  and,  in  the  case  of  non-manifold  models,  use 
entities  for  the  vertex,  edge,  loop,  and  face  entities  [89], 
We  will  introduce  any  of  these  additional  entities  only 
as  needed. 


on  which  it  lies.  Mesh  entities  are  always  classified  with 
respect  to  the  lowest  order  object  entity  possible. 

Any  automated  adaptive  analysis  must  consider  both  the 
geometric  domain  representation, 
and  the  mesh  representations,  „r2(m{„<S},m{^r}) 
where  is  limited  to  pointwise  information  at 

specific  locations  obtained  by  interrogation  of  the  geo¬ 
metric  model  representation.  Since  the  mesh  representa¬ 
tion  lacks  the  complete  geometric  shape  information  of 
the  geometric  domain  representation,  that  shape  infor¬ 
mation  must  be  accessed  during  various  operations  such 
as  integrating  elements  to  the  true  geometry,  or  placing 
new  nodes  defined  by  adaptive  refinement  on  the  true 
boundary  of  the  domain. 

Classification  of  the  mesh  against  the  geometric  domain  is 
central  to  (i)  ensuring  that  the  automatic  mesh  generator 
has  created  a  valid  mesh  [66,  67],  (ii)  transferring  analysis 
attribute  information  to  the  mesh  [75],  (iii)  supporting  h- 
type  mesh  enrichments,  and  (iv)  integrating  to  the  exact 
geometry  as  needed  by  high  order  elements. 

In  addition  to  the  mesh  representation,  it  is  often  desirable 
to  consider  other  derived  representations  of  the  domain. 
The  one  of  central  importance  to  the  parallel  adaptive 
analysis  is  the  processor  representation,  This  repre¬ 
sentation  is  an  intermediate  representation  between  that  of 
the  mesh  and  the  geometric  domain.  Therefore,  its  topo¬ 
logical  entities  can  be  classified  against  the  geometric  do¬ 
main.  Since  the  mesh  is  the  lowest  order  representation, 
its  entities  can  be  classified  against  both  the  geometric 
domain  and  the  processor  representation. 

An  additional  representation  employed  in  the  parallel 
mesh  generation  procedure,  and  one  set  of  parallel  adap¬ 
tive  procedures,  is  an  octree  representation.  Since  tree 
representations  are  derived  to  support  specific  searching 
operations,  or  spatial  enumerations,  they  vary  dramati¬ 
cally  from  the  topological  hierarchies  used  to  define  the 
geometric  domain  and  mesh.  Structures  of  these  types 
will  be  described  as  they  are  used  in  specific  algorithms. 
The  adjacencies  of  various  order  mesh  topological  en¬ 
tities  and  their  classification  with  respect  to  the  higher 
order  models  are  used  to  support  a  great  number  of  the 
operations  required  by  a  parallel  automated  adaptive  anal¬ 
ysis.  Therefore,  it  is  important  that  they  can  be  quickly 
determined.  Clearly,  if  the  adjacencies  of  each  order 
entity  against  all  other  entities  were  stored,  all  possible 
adjacency  information  would  be  readily  available.  This 
approach  would  be  highly  wasteful  with  respect  to  the 
amount  of  data  storage  required.  On  the  other  hand,  stor¬ 
ing  only  a  minimal  number  of  adjacencies  could  require 
extensive  searches  and  sorts  to  determine  other  specific 
adjacencies.  An  examination  of  the  specific  adjacencies 
used  by  the  various  algorithmic  operations  provides  guid¬ 
ance  as  to  the  minimum  number  of  adjacencies  needed. 
For  example  references  [6,  13,  30,  38]  define  adjacencies 
used  in  specific  finite  volume  and  finite  element  proce¬ 
dures.  Since  the  procedures  considered  here  must  support 
the  highly  demanding,  from  the  view  point  of  topologi¬ 
cal  adjacencies,  automatic  mesh  generation  procedures. 
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and  any  form  of  adaptive  analysis  on  conforming  un¬ 
structured  meshes^,  all  adjacencies  are  either  stored,  or 
can  be  quickly  determined  through  a  set  of  local  travers¬ 
als  and  sorts  which  are  not  a  function  of  the  mesh  size. 
One  set  of  relationships  that  can  effectively  meet  these 
requirements  is  to  maintain  adjacencies  between  entities 
one  order  apart.  Figure  1  graphically  depicts  this  set  of 
relationships  as  well  as  the  classification  with  respect  to 
the  geometric  domain  representation. 


Figure  1.  Mesh  topological  adjacencies 
and  classification  information 


Since  there  are  natural  orderings  for  several  of  the  adja¬ 
cencies  which  prove  useful  to  the  operations  performed, 
the  forms  of  adjacencies  employed  are:  an  unordered  list 
of  n  entities  adjacent  to  entity  signified  by 
a  linear  list  of  n  entities  adjacent  to  entity  •S  signified 
by  and  a  cyclic  list  of  n  entities  adjacent  to 

entity  i}  signified  by  Specific  entities  also 

store  directional  knowledge  of  how  that  entity  is  used  in 
the  specific  adjacency.  In  these  cases  the  left  superscript, 
±,  on  the  entity,  ^Tf,  indicates  a  directional  use  of  the 
topological  entity  ^Tf  as  defined  by  its  ordered  defini¬ 
tion  in  terms  of  lower  order  entities.  A  -f  indicates  use 
in  the  same  direction,  while  a  —  indicates  a  use  in  the 
opposite  direction. 

The  specific  downward  adjacencies  stored  are: 

For  mesh  regions 

(4) 

which  indicates  the  faces  bounding  the  mesh  region, 
where  n  =  4  for  a  tetrahedron,  n  =  6  for  a  hexahe¬ 
dron,  etc. 

^  A  conforming  mesh  is  one  where  all  mesh  entities  ex¬ 
actly  match.  For  example,  a  situation  where  the  mesh 
edge  bounding  one  mesh  face  has  two  mesh  edges  from 
another  mesh  face  lying  exactly  on  top  of  it  is  not  al¬ 
lowed.  Although  possible  to  extend  the  procedures  pre¬ 
sented  here  to  support  those  situations,  they  will  not  be 
considered  in  the  present  document. 


For  mesh  faces 


rji2  r dl/Till  (^) 


which  defines  the  loop  of  edges  that  bound  the  face,  where 
n  =  3  for  a  triangular  face  and  ,  n  =  4  for  a  quadrilateral 
face. 

For  mesh  edges 

(6) 

which  indicates  the  two  vertices  that  bound  the  edge. 
The  specific  upward  adjacencies  stored  are: 

For  mesh  vertices 

(7) 

which  indicates  the  edges  the  vertex  is  on  the  boundary 
of. 

For  mesh  edges 

(8) 

which  indicates  the  faces  the  edge  partly  bounds. 

For  mesh  faces 

(9) 

which  indicates  the  zero,  one,  or  two  regions  the  face 
partly  bounds. 

An  alternative  set  of  adjacencies  which  can  directly  meet 
the  needs  of  many  applications  is  to  maintain  the  same 
downward  adjacencies  and  store  only  the  single  upward 
adjacency  from  the  vertices  to  the  highest  order  entities 
using  them.  In  the  case  of  a  manifold  mesh  in  3-D  this 
upward  adjacency  would  be 

(10) 

which  are  the  regions  that  the  vertex  bounds.  In  the 
case  of  general  non-manifold  models,  it  is  the  upward 
adjacencies  form  the  vertices  to  any  mesh  entity  it  bounds 
which  itself  is  not  bounded  by  a  higher  order  entity.  In 
this  case  the  adjacency  relationship  is  a  bit  more  complex 
being 

j,o(  j,3  ji2  ri  I  I  1 1  ^ 

mt\m  ^  m-*-  Jm"*-  Mm-^  \_m-^  J| 

.  (11) 

\mT^UT^}\=o} 

This  set  includes  the  regions  the  vertex  bounds,  the  faces 
the  vertex  bounds  which  do  not  bound  any  regions,  and 
the  edges  the  vertex  bonds  which  do  not  bound  any  faces. 

2.2.  Partition  Communication 
and  Mesh  Migration 


Adaptive  unstructured  meshes  on  distributed  memory 
computers  require  data  structures  which  provide  efficient 
queries  for  various  entity  and  processor  adjacency  infor¬ 
mation  as  well  as  fast  updates  for  changes  in  the  mesh. 
The  requirements  for  sequential  implementations  of  hp- 
adaptive  finite  element  methods  can  be  satisfied  by  the 
SCOREC  mesh  database  just  given.  For  parallel  appli¬ 
cations,  we  first  enumerate  the  major  requirements  of  a 
distributed  memory  mesh  environment.  These  require¬ 
ments  are  met  by  the  distributed  mesh  environment  Par¬ 
allel  Mesh  Database  (PMDB)  that  is  then  described. 
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2.2.1  Requirements  of  PMDB  and  Related  Efforts 

A  parallel  mesh  database  must: 

•  Provide  a  common  interface  and  a  single  library 
for  all  the  mesh  related  applications,  namely,  mesh 
generation,  mesh  refinement/coarsening  and  finite 
element  analysis. 

•  Provide  a  full  spectmm  of  adjacency  relations  among 
shared  entities  on  different  processors. 

•  Provide  a  general  purpose  mesh  migration  algorithm 
which  will  facilitate  arbitrary  mobility  of  mesh  en¬ 
tities  on  processors.  Additionally,  the  update  pro¬ 
cedures  for  data  structures  should  be  scalable  after 
migration. 

•  Support  meshes  generated  on  non-manifold  models. 
In  a  non-manifold  representation  the  surface  area 
around  a  given  point  on  a  surface  might  not  be 
flat  in  the  sense  that  the  neighborhood  of  the  point 
need  not  be  a  simple  two-dimensional  disk  [89]. 
Figure  2  shows  examples  of  meshes  on  non-manifold 
geometric  models.  Just  as  the  mesh  data  structures, 
the  PMDB  can  handle  the  situations  in  which  mesh 
entities  attach  to  vertex  contacts.  This  specifically 
requires  the  ability  for  such  entities  to  be  migrated 
with  no  loss  of  information,  and  that  the  vertex  at 
the  contact  can  be  a  shared  partition  boundary  entity. 


Figure  2.  Example  meshes  handled  by  PMDB  library 


The  early  parallel  and  distributed  memory  implementa¬ 
tions  of  finite  element  methods  such  as  [51]  involved 
static  meshes  and  used  the  data  parallel  SIMD  computing 
systems  such  as  CM2.  The  ease  of  programming  static 
and  regular  problems  using  the  data  parallel  model  led  the 
compiler  writers  to  incorporate  this  model  in  high  perfor¬ 
mance  Fortran  compilers.  The  analysis  for  generating 
communication  primitives  for  irregular  references  found 
in  unstructured  meshes  could  not  be  done  at  compile  time. 
Therefore,  runtime  systems  such  as  the  PARTI  primitives 
[63]  were  designed  which  would  compute  these  refer¬ 
ences  prior  to  entering  a  loop  where  the  actual  computa¬ 
tions  are  done.  If  the  distribution  of  the  mesh  changes, 
then  all  the  references  have  to  be  recomputed.  Since 
limited  analysis  can  be  done  at  the  level  of  references 
only,  the  data  parallel  Fortran  compilers  soon  proved  to 
be  weak  for  handling  the  dynamically  changing  mesh  data 
structures  of  adaptive  applications.  This  weakness  has  di¬ 
rected  other  researchers  to  design  distributed  mesh  envi¬ 


ronments  providing  functionalities  for  refinement,  coars¬ 
ening,  migration  and  load  balancing. 

A  heuristic  which  has  been  the  by  product  of  high  perfor¬ 
mance  Fortran  compilers  is  the  owner  computes  paradigm 
[11][95].  This  heuristic  was  used  as  a  rule  for  letting  the 
processor  which  owns  a  data  item  to  perform  the  com¬ 
putations  which  define  it.  This  paradigm  is  also  used  in 
other  contexts  such  as  parallel  linear  solvers  provided  by 
PetSc  [31]  which  requires  the  designation  of  owners  of 
the  rows.  A  variation  of  this  paradigm  is  used  in  imple¬ 
menting  the  current  mesh  migration  algorithm. 

Williams’  Distributed  Irregular  Mesh  Environment 
(DIME)  project  [90]  can  be  considered  as  one  of  the  ear¬ 
liest  distributed  unstructured  mesh  environments.  This 
initial  version  was  restricted  to  two  dimensional  meshes 
and  could  not  handle  non-manifold  models  and  surface 
meshes  such  as  a  torus.  The  newer  version  DIME++ 
[93]  implemented  in  C-t-t  provides  support  for  three 
dimensional  elements. 

DIME  uses  a  hash  table  to  implement  voxel  databases 
[92]  which  store  a  global  key  associated  with  an  entity. 
This  key  is  the  geometric  centroid  of  the  entity.  The  co¬ 
ordinates  of  the  centroid  are  converted  to  integer  hash 
table  index  by  dividing  it  with  a  user  supplied  tolerance. 
We  show  in  sections  2.2.2  and  2.2.3  that  explicit  genera¬ 
tion  of  global  key  by  computing  and  storing  the  centroid 
is  not  necessary.  When  elements  are  migrated  in  DIME, 
new  voxel  entries  are  packaged  into  a  message  and  the 
message  is  passed  from  processor  to  processor  in  a  ring 
until  each  has  seen  the  message.  Each  processor  takes 
the  voxel  entry  and  checks  if  a  match  is  found  in  the 
hash  table.  If  found,  then  this  implies  that  the  entity  is 
shared  and  the  off-processor  address  is  stored.  Note  that 
Williams  uses  the  notion  of  secretary  points  which  cor¬ 
respond  to  the  owner  of  shared  entities  in  PMDB.  Even 
though  the  secretary  points  are  used  in  computing  the 
scalar  products,  they  are  not  utilized  in  the  implemen¬ 
tation  of  an  efficient  update  procedure  after  migration. 
Since  the  new  voxels  are  passed  in  a  ring  of  all  proces¬ 
sors,  the  update  procedure  has  a  fixed  cost  dependent  on 
the  number  of  processors. 

Vidwans  et  al.  [85]  present  a  procedure  to  migrate 
tetrahedral  elements  between  face  adjacent  and  sender- 
receiver-disjoint  processors.  The  sender-receiver-disjoint 
requirement  necessitates  processors  involved  in  migration 
to  be  paired  as  either  a  sender  or  a  receiver.  This  pairing 
process  is  carried  out  as  part  of  their  divide  and  conquer 
dynamic  load  balancing  algorithm.  Since  a  face  can  be 
shared  by  no  more  than  two  processors  and  a  processor 
migrates  to  its  face-adjacent  processor,  the  shared  face 
identification  is  readily  available.  Hence  Vidwans  et 
al.  does  not  need  use  global  identification  numbers.  A 
disadvantage  of  sender-receiver  disjoint  migration  is  that 
elements  cannot  be  piped  by  a  receiver  processor  to  other 
processors  in  the  same  cycle  of  migration.  This  can 
lead  to  memory  problems  whereby  a  receiving  processor 
obtains  a  large  number  of  elements  and  has  to  store  them 
before  it  can  pass  them  onto  other  processors. 
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The  Tiling  system  developed  by  Devine  [18]  is  the  first 
distributed  environment  to  support  hp-adaptive  analysis 
and  provides  migration  routines  for  regularly  structured 
two  dimensional  meshes  which  can  be  hierarchically  re¬ 
fined.  Each  tiling  element  stores  pointers  to  neighboring 
four  elements  with  partition  boundary  elements  pointing 
to  a  ghost-element  data  which  acts  as  a  buffer  during 
communication.  The  elements  are  assigned  a  unique  id 
at  the  beginning  and  after  refinements.  The  elements  with 
unique  ids  are  maintained  in  a  balanced  AVL  tree  [68]  to 
allow  efficient  insertion  and  deletion  during  migration. 
The  Tiling  system  supports  only  rectangular  elements 
as  the  basic  entity  and  the  notion  of  shared  entities  like 
edges  is  implicit. 

2.2.2  Distributed  Mesh  Model  and  Notation  Used 

The  distributed  mesh  is  viewed  analogous  to  the  model¬ 
ing  of  non-manifold  geometric  objects.  Figure  3  shows 
the  hierarchical  classification  of  the  global  mesh  enti¬ 
ties  mTf,  the  processor  model  entities  j,Tf  and  geomet¬ 
ric  model  entities  gTf  .  Given  the  set  of  mesh  entities 
{mT},  a  partitioning  at  the  dm  dimension  level  divides 
the  mesh  into  Up  parts,  pTp^ ,  each  of  which  is  assigned 
to  a  processor  with  id  =  0, . . . ,  —  1.  As  a  re¬ 

sult  of  partitioning,  some  of  the  entities  with  dimension 
d  <  dm  will  be  shared  by  more  than  one  processor.  The 
drn-dimensional  entity  will  be  held  by  only  one  proces¬ 
sor.  Hence  in  general,  partitioning  with  >  0  defines  a 
one-to-many  relation  from  a  mesh  entity  mTf  to  its  uses 
mTf  where  k  <  'min{A{mTf),np).  Here  A  defines  the 
degree  of  an  entity,  i.e.  given  the  dimension  d  of  an  en¬ 
tity,  A  is  the  number  of  d  -I- 1  dimensional  entities  which 
use  it. 

Since  the  procedures  in  a  distributed  memory  environ¬ 
ment  operate  on  private  local  processor  address  space, 
we  refer  to  each  entity  use  in  the  global  model  as 
{^k,a.k)j,d  Qj.  shorthand  notation  {pk,ak)-  The  tuple 
{Pk,ak)  stands  for  the  use  of  an  entity  by  processor  pk  at 
local  address  ak.  In  the  algorithm  descriptions  presented 
later  this  tuple  is  also  called  a  link  particularly  if  it  is 
stored  on  a  different  processor  than  pk- 
For  the  implementation  of  owner  computes  paradigm, 
one  of  the  processors  holding  a  given  entity  mTf  is 
designated  as  the  owner  of  that  entity.  In  the  distributed 
processor  address  space,  we  distinguish  the  owned  entities 
as  (poi«o)-  Therefore,  a  partitioning  in  this  case  defines 
a  one-to-one  and  onto  mapping  of  global  mesh  entities 
onto  the  owned  distributed  mesh  entities:  Note  that  the 
inverse  of  this  mapping  exists  and  hence  the  pair  {po,  ao) 
can  serve  as  a  global  key  of  a  distributed  entity. 

The  uses  of  the  shared  entities  are  mapped  onto  the  owner 
entity  by  a  many-to-one  relation  : 

{Pkak)>->'  {Po,ao)  (12) 

Figure  3  shows  the  relationship  between  the  geometric 
model  entities  gTf  ,  the  global  mesh  entities  mTf  and 
the  processor  model.  Given  the  uses  {pk,ak)  of  an 
entity  distributed  over  processors  pk,  an  agreement  can 


be  reached  among  these  processors  on  whether  they  hold 
the  identical  entity  by  computing  the  ownership  using  the 
function  $. 


processor  0  processor  1 


processor  2  processor  3 


Figure  3.  The  relationship  between  the  mesh 
model,  processor  model  and  the  geometric  model 

2.2.3  Data  Structures 

PMDB  data  structures  were  designed  to  provide  full  variety 
of  adjacency  information.  At  the  micro  level  of  a  partition 
boundary  entity,  one  should  be  able  to  get  all  the  uses 
or  links  of  an  entity  on  other  processors.  Each  partition 
boundary  entity  stores  all  the  uses  on  other  processors  as  a 
linked  list.  This  is  shown  in  Figure  4.  Note  that  one  of  the 
processors  holding  a  shared  entity  is  marked  as  an  owner 
of  that  entity.  The  bold  edges  and  vertices  indicate  the 
owners  of  the  shared  entities.  This  ownership  information 
can  be  used  in  the  implementation  of  the  owner  computes 
rule,  for  example,  during  link  updates  in  mesh  migration 
or  scalar  product  computation  in  an  iterative  linear  solver. 


PROCESSOR  0  PROCESSOR  1 


Figure  4.  PMDB  inter  processor 
links  and  entity  ownership 

Since  each  processor  stores  the  uses  {pk,a,k)  on  all  the 
processors  that  hold  a  shared  entity,  the  ownership  can 
be  computed  as  a  function  of  these  uses.  An  example 
of  an  ownership  function  $  given  in  equation  12  is 
to  choose  the  processor  which  has  the  tuple  {pk,cLk) 
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as  the  minimum.  The  other  alternative  is  to  let  the 
owner  regenerate  the  ownership.  Whereas  the  former 
method  can  be  done  locally,  the  latter  method  needs 
communication  of  ownership  information  from  the  owner 
to  the  holders. 

Note  that  the  ownership  information  provides  a  global 
key  for  identifying  an  entity  uniquely  over  all  processors. 
Since  the  pair  {po,  ao)  is  the  global  key,  there  is  no  need 
to  generate  and  store  a  separate  key  as  Williams  [92]  does 
by  computing  the  centroid  of  the  entity.  On  a  processor, 
at  the  level  of  entities,  the  sets  of  entities  that  are  on  the 
partition  boundary  or  adjacent  to  a  specific  processor  are 
organized  in  doubly  linked  lists  which  provide  constant 
insertion  and  deletion.  Figure  5(a)  shows  the  organiza¬ 
tion  of  the  partition  boundary  entities.  The  lists  can  be 
traversed  to  get  partition  boundary  entities  shared  among 
processors.  For  example,  the  set  of  all  partition  boundary 
vertices  given  by; 

v  =  \^Tf-.  3;rfs.t.^nn^TfA 

^  s  (13) 

0<d<lA|p7f{pr2}|  >l| 

can  be  enumerated  by  the  data  structure.  The  set  of 
all  partition  boundary  edges  E  which  can  be  similarly 
defined,  is  also  readily  available. 


PROCESSOR  0 


I  vertices  |  edges  |  faces 


w  vy  yy  yy 

©-►(l.aj-^  (3,a2)-^(2,aj) 


(b) 

Figure  5.  Doubly  linked  structures  of 
partition  boundary  entities  ;  global  view 
(a)  and  partition  boundary  entity  view  (b) 

In  addition,  adjacent  processors  based  on  various  entity 
connectivity  as  well  as  the  number  of  entities  adjacent  to 


the  processor  are  maintained  by  storing  this  information  in 
a  linked  list.  Figure  5(b)  shows  the  structure  of  the  vertex 
adjacent  processors  and  the  doubly  linked  lists  attached 
to  it. 

The  list  of  partition  boundary  vertices  adjacent  to  a 
particular  processor  pk  can  be  given  by: 


which  is  directly  accessible  from  the  data  structures. 

2.2.4  Mesh  Migration 

Analogous  to  the  owner  computes  rule,  the  mesh  migra¬ 
tion  procedure  of  PMDB  uses  an  owner  updates  rule  to 
collect  and  update  any  changes  to  the  links  on  partition 
boundaries  after  moving  entities  among  processors.  The 
migration  of  a  set  of  mesh  entities  from  a  given  processor 
to  destination  processors  proceeds  in  three  stages.  Firstly, 
sender  processors  migrate  the  mesh  entities  to  receiver 
processors.  Secondly,  the  senders  and  receiver  proces¬ 
sors  report  the  deletions  or  new  addresses  of  migrated 
mesh  entities  to  owner  processors.  In  the  last  stage,  the 
owner  processors  inform  the  affected  processors  about  the 
updates  in  links.  The  processing  which  is  done  in  the  first 
stage  is  proportional  to  the  number  of  mesh  entities  be¬ 
ing  migrated,  whereas  in  the  second  and  third  stages,  it  is 
proportional  to  union  of  boundary  of  the  migrated  mesh 
entities.  The  migration  procedure  is  given  in  Figure  6  and 
the  detailed  steps  of  the  algorithm  are  explained  below: 
Senders  to  Receivers:  These  steps  are  responsible  for 
sending  the  raw  mesh  data  from  the  sender  to  the  receiver 
processors.  The  mesh  entities  in  }  to  be  sent  are 

packed  into  messages  together  with  the  data  attached  to 
the  entities.  The  entities  on  the  union  of  the  boundary 
of  the  migrated  mesh  entities  are  also  found,  since  any 
possible  link  updates  will  be  limited  to  these. 

In  the  case  of  all  tetrahedral  mesh  in  3D  space,  the  mi¬ 
grated  boundary  is  given  by  faces  which  have  exactly 
one  migrated  region  targeted  to  the  same  processor  at¬ 
tached  to  it.  This  applies  to  two  dimensional  meshes  also 
with  the  migrated  boundary  enclosed  by  edges  having 
exactly  one  face  on  its  side  which  is  being  migrated  to 
the  same  processor.  Finding  the  migrated  boundary  for 
three  dimensional  meshes  which  contain  both  tetrahedral 
and  dangling  faces  as  shown  in  Figure  2(b)  requires  ad¬ 
ditional  work.  In  this  case  if  dangling  faces  are  being 
migrated  then  migrated  boundary  cannot  be  derived  by 
just  checking  the  edges  in  the  manner  that  is  done  for  2D 
meshes.  Additionally,  the  vertices  must  be  checked  to  see 
if  they  are  used  by  any  edge  which  is  not  being  migrated. 
The  migrated  internal  entities  can  be  deleted  immediately 
since  they  cannot  be  referred  to  again  by  any  processor. 
The  migrated  boundary  entities  cannot  be  deleted  imme¬ 
diately,  since  if  they  happen  to  be  owned  by  the  processor, 
they  will  act  as  a  fixed  point  where  all  the  shared  entity 
uses  will  be  collected  later. 
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procedure  mesh_migrate  }^Pr,{mTr;  }) 

input:  Ps'.  destination  processors. 

{mPfi  }  •  sets  of  regions  to  be  migrated 
output:  Pr’.  source  processors 

sets  of  regions  received 

begin 

/*  1.  senders  and  receivers  to  owners  */ 

1  Pack  the  mesh  }  to  be  sent. 

2  Find  the  migrated  boundary. 

3  Delete  migrated  internal  entities 

4  Pack  the  owners’  uses  corresponding  to  migrated 
boundary 

5  Send  packed  submeshes  and  uses  to  Ps^ 

6  Receive  packed  submeshes  and  uses  from  p., 

7  Unpack  the  submeshes  to  get  } 

/*  2.  senders  and  receivers  to  owners  */ 

8  Establish  usage  of  both  sent  and  received  migrated 
boundary  entities. 

9  Pack  local  uses  of  migrated  boundary  and  owners 
uses  to  be  sent  to  owner  processors  P, 

10  Send  packed  local  and  owner  uses  to  owner  proces¬ 
sors. 

11  Receive  packed  uses  from  senders  and  receivers. 

/*  3  owners  to  affected  */ 

12  Owners  update  use  lists  by  inserting/deleting  re¬ 
ceived  local  uses  into/from  use  lists  pointed  to  by 
owner  uses  and  generate  new  ownerships. 

13  Pack  updated  uses  list  of  entities  to  be  sent  to  af¬ 
fected  processors  Pa. 

14  Send  updated  use  lists  and  ownership  to  owner  pro¬ 
cessors. 

15  Receive  updated  uses  list  and  ownership  from  owner 
processors. 

16  Pa  update  use  lists  and  ownership. 

17  Delete  unused  sent  migrated  boundary  entities. 

end 

Figure  6.  Mesh  Migration  Algorithm 

Once  the  packed  submesh  has  been  received,  the  proces¬ 
sors  unpack  it  and  insert  it  into  the  mesh  held 

by  the  processor  pk-  It  is  also  possible  that  when  more 
than  one  submesh  arrives  from  different  processors,  they 
all  might  share  some  common  entities.  Figure  7  shows 
an  example  of  such  a  case.  As  shown  processors  0  and 
2  both  migrate  to  processor  1.  Among  the  migrated  enti¬ 
ties  are  those  which  are  shared  by  both  0  and  2.  In  such 
a  case,  these  commonly  shared  entities,  once  unpacked, 
should  not  be  unpacked  for  the  subsequent  received  sub¬ 
meshes  which  also  contain  them  and  comes  from  a  differ¬ 
ent  processor.  This  process  is  achieved  by  inserting  the 
unpacked  migrated  boundary  entities  into  a  red-black  tree 
[68]  which  has  guaranteed  logarithmic  access  for  each  in¬ 
serted  entity.  A  key  is  needed  to  represent  the  entity  in 
the  red-black  tree.  This  key  can  be  either  a  global  key  or 
the  readily  available  {po,o,o)  tuple  which  was  discussed 
earlier.  Currently,  PMDB  version  3.1  by  default  gener¬ 


ates  global  numbers  after  mesh  is  refined.  The  global 
numbers  can  be  used  for  debugging  and  also  provides 
a  readily  available  equation  number  for  linear  equation 
solvers  which  assemble  the  global  matrix.  A  future  ver¬ 
sion  of  PMDB  will  make  the  global  number  generation 
optional  in  order  to  save  memory  for  applications  which 
do  not  need  it. 

Senders  and  Receivers  to  Owners:  These  steps  operate 
only  on  the  sent  and  received  migrated  boundary  entities. 
These  entities  are  tested  to  see  if  they  are  used  by  pT 
on  processor  p.  Determining  the  use  on  processor  p  of  a 
d-dimensional  entity  requires  determining  if  that  entity 
is  part  of  the  boundary  of  a  d  -f  1  dimensional  entity 
on  processor  p.  The  entity  hierarchy  data  structures  of 
SCOREC  mesh  database  readily  provide  this  d  to  d  -I-  1 
dimensional  entity  adjacency  relationship.  If  the  entity  is 
used,  its  use  (p,  a)  is  packed  and  identified  by  the  (po,  Uo) 
use  to  be  sent  to  owner  processor.  If  the  entity  is  not  used 
(p,  null)  is  packed.  Once  packed,  this  information  is  sent 
to  the  owner  processors.  The  overall  complexity  of  these 
steps  is  proportional  to  the  size  of  the  sent  and  received 
migrated  boundaries. 

Owners  to  Affected  Processors:  Owners  receive  updates 
targeting  a  particular  entity  (po,ac>)  it  owns.  If  a  use 
(p,  a)  is  received,  it  is  inserted  in  the  list  of  uses  of 
the  shared  entity  at  address  ag.  If  {p,  null)  is  received, 
the  use  (p,  a)  is  deleted  from  the  list  of  uses  at  address 
Uq.  Once  all  the  updates  are  completed,  the  ownership 
of  these  entities  are  regenerated.  The  updated  links  are 
then  packed  and  sent  to  the  affected  processors.  The 
affected  processors  receive  these  uses  and  update  the 
corresponding  local  shared  entities’  list  of  uses.  At  this 
point,  the  migrated  boundary  entities  can  be  deleted  and 
mesh  migration  completes. 

Computing  Number  of  Receives:  The  steps  5  —  6, 10  - 
11  and  14  —  15  implement  non-blocking  sends  and  re¬ 
ceives.  Each  processor  needs  to  know  how  many  mes¬ 
sages  are  being  sent  to  it  by  other  processors  so  that  it 
can  post  a  corresponding  number  of  receive  statements. 
A  simple  way  to  compute  the  number  of  receives  is  by 
first  having  each  processor  initialize  a  vector  r  of  length 
Up  and  to  set  Vp  to  1  if  a  message  will  be  sent  to  proces¬ 
sor  p  and  0  otherwise.  A  follow-up  sum  scan  operation 
can  then  be  executed  by  all  the  processors  resulting  in 
each  location  Vp  containing  the  number  of  receives.  This 
procedure  has  0{nplognp)  run  time  complexity  and  re¬ 
quires  a  message  of  length  Up  to  be  communicated  dur¬ 
ing  the  combine  operation.  Whereas  this  scheme  will 
be  efficient  for  small  Up,  it  is  nevertheless  non-scalable. 
The  DIME  environment,  for  example,  makes  use  of  the 
crystal_router  [24]  which  provides  a  scheme  for 
this  problem  by  utilizing  log{np)  message  exchanges 
across  the  dimensions  of  the  hypercube  multiprocessors. 

Considering  the  fact  that  each  processor  p  usually  sends 
to  a  small  number  Sp  of  processors,  a  scalable  strategy 
is  desirable  for  large  rip.  We  can  make  this  scheme 
scalable  by  making  use  of  the  radix  sort  routine  [7]. 
Since  the  processor  ids  are  in  the  range  0, . . . ,  rip  -  1, 
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(a) 


(C) 


Figure  7.  Example  showing  steps  of  mesh  migration 

this  problem  can  be  solved  by  sorting  T*  =  Y,p=o  ^ 
keys  each  of  length  log{np)  bits.  Before  applying  the 
radix  sort,  the  keys  are  balanced  by  moving  them  such 
that  each  processor  has  Tg/up'^.  The  balancing  can  be 

^  assuming  wlog  that  Tg  is  divisible  by  Up 


done,  by  first  scan  summing  Sp  to  get  the  total  number 
of  messages  and  assigning  a  rank  to  each  send.  The 
target  processor  and  the  location  on  that  processor  can 
then  be  computed  knowing  the  rank  of  the  send  and 
the  total  number  of  sends.  The  complexity  of  such  a 
procedure  will  be  0{maXp{sp}  +  (Tg/np)-log‘^np)  which 
is  polylogarithmic  in  when  Sp  is  small. 

We  illustrate  this  with  an  example  when  Sp  =  1.  Consider 
the  following  scenario  with  the  intended  sends  in  the  first 
row  of  Table  1 .  a;  is  assigned  the  value  8  which  equals 
Up  and  indicates  the  processors  which  will  not  send  any 
messages  to  anyone. 


0 

1 

2 

3 

4 

5 

6 

7 

send  to  processor 

X 

X 

7 

4 

X 

X 

0 

4 

sort 

0 

4 

4 

7 

X 

X 

X 

X 

mark  end 

1 

0 

1 

1 

segment  sum 

1 

1 

2 

1 

number  of  recvs 

1 

2 

1 

ranges  of  no-recvs 

[1,3] 

[5,6] 

Table  1  Example  showing 
computation  of  number  of  receives 


After  sorting  the  sends,  the  duplicate  sends  are  now  con¬ 
tiguous.  We  can  mark  the  end  of  duplicate  sends  by  com¬ 
municating  and  comparing  sends  with  the  right  neighbor 
processor.  A  segment-sum  [7]  can  then  be  performed  to 
count  the  number  of  duplicates.  The  processor  marked  at 
the  end  of  the  segment  has  the  number  of  receives  for  the 
corresponding  target  processor.  Note,  however,  that  even 
though  we  can  now  inform  the  receiving  processors  of 
the  number  of  messages,  we  still  need  to  make  sure  that 
each  processor  posting  a  receive  will  be  matched  by  a 
corresponding  send  by  some  processor.  This  can  be  done 
by  letting  all  processors  post  receive  messages  which  can 
be  satisfied  by  guaranteed  sends.  As  a  result,  we  need  to 
send  messages  to  non-receivers  that  they  will  be  receiving 
no  messages.  The  last  row  computes  the  ranges  of  ids  of 
non-receiving  processors  which  can  be  computed  again  by 
a  neighbor  communication.  Given  this  contiguous  range, 
the  processors  can  be  sent  a  message  by  parallel  recur¬ 
sive  bisection  strategy  in  logarithmic  time.  At  the  end, 
all  processors  will  know  how  many  messages  they  will 
be  receiving. 

In  the  above  scheme,  the  complexity  is  dominated  by 
the  sorting  scheme.  A  more  elaborate  scheme  in  the 
reference  [41]  provides  radix  sorting  in  log{np).  We  also 
remark  that  currently  PMDB  uses  the  simple  OinplogUp) 
procedure  since  the  largest  number  of  processors  used 
is  64,  a  number  too  small  to  make  the  scalable  version 
worthwhile  to  use. 
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2.2.5  Scalability  of  Mesh  Migration  and  Extensions 

In  the  mesh  migration  procedure  presented  above,  the 
amount  of  communication  involved  is  proportional  to  the 
volume  of  submeshes  in  the  first  stage  of  the  algorithm 
and  to  the  surface  of  submeshes  during  link  updates 
in  the  second  and  third  stages.  As  a  result,  if  each 
processor  migrates  to  a  small  number  of  processors,  such 
as  its  neighbors,  then  we  expect  that  the  migration  will 
scale  as  the  number  of  processors  is  increased.  Various 
tests  have  been  performed  to  demonstrate  scalability  of 
migration.  The  data  involving  the  maximum  number 
of  regions  migrated  by  a  processor,  the  total  number  of 
regions  migrated  by  all  processors,  the  time  taken,  and  the 
throughput,  that  is,  the  number  of  regions  sent  by  a  single 
processor  per  second  are  plotted  against  the  number  of 
processors  used. 

Test  1:  In  the  first  test,  we  let  each  processor  exchange  a 
slice  on  its  partition  boundary  with  its  neighbors.  This  test 
is  a  realistic  representative  of  the  migration  patterns  that 
occur  in  iterative  dynamic  load  balancers  since  regions 
near  partition  boundaries  are  migrated  in  clusters  to  the 
neighborhood  of  a  heavily  loaded  processor.  Another 
application  that  performs  this  kind  of  migration  is  mesh 
coarsening  [10].  Figure  8  shows  the  example  mesh  that 
was  used  before  (a)  and  after  migration  (b).  Figure  9(a) 
plots  the  maximum  number  of  regions  sent  by  a  processor 
and  (b)  shows  the  wall  time  taken.  From  these  plots,  we 
see  that  execution  time  is  proportional  to  the  number  of 
regions  sent  irrespective  of  the  number  of  processors. 
Figure  10  on  the  other  hand  plots  the  total  number  of 
regions  sent  by  all  processors.  As  the  number  of  pro¬ 
cessors  are  increased  the  total  number  of  regions  at  par¬ 
tition  boundaries  increases.  Hence  even  though  overall 
more  regions  have  been  moved,  the  time  is  proportional 
to  the  maximum  sent  by  a  single  processor.  This  behav¬ 
ior  demonstrates  that  when  processors  migrate  to  a  small 
number  of  neighbors,  the  migration  procedure  scales  well. 
Figure  10(b)  plots  the  throughput  attained. 


(a)  (b) 


Figure  8.  Neighborhood  migration  test;  before 
boundary  exchange  (a),  after  boundary  exchange  (b) 

Test  2:  In  the  second  test,  we  let  each  processor  hold 
2500  regions  corresponding  to  a  partition  of  the  box 
mesh  and  migrate  all  its  regions  randomly  targeted  to 
s  processors  with  s  =  1, . . . ,  2\  . . . ,  rip  —  1.  The  plots  of 
time  taken  for  migration  and  the  throughput  per  processor 
is  shown  in  Figure  11.  The  plot  in  (a)  shows  that  as  the 
number  of  processors  is  increased,  the  time  taken  grows 


(a) 


(b) 


Figure  9.  Neighborhood  migration  test 
for  box  ;  maximum  number  of  regions 
migrated  by  a  processor  (a),  wall  time  (b) 

slowly.  In  particular,  if  we  look  at  the  s  =  1  case,  we  see 
a  flat  curve  between  32  and  64  processors.  The  number 
of  processors  has  been  doubled,  yet  the  execution  time 
remains  the  same.  As  s  is  increased  the  execution  time 
growth  is  larger  as  expected,  since  the  number  of  total 
migrations  is  increased.  In  particular,  if  s  =  rip  -  1,  we 
have  all-to-all  migration.  Note  that,  there  is  a  pronounced 
drop  in  the  throughput  as  shown  in  Figure  11(b)  between 
the  cases  s  =  1  and  2.  For  example,  with  rip  =  48, 
the  throughput  is  519  regions  for  s  =  1  and  drops  to 
309  regions  at  s  =  2.  The  major  cause  of  this  drop 
is  not  the  mere  increase  in  s,  but  rather  the  fact  that 
when  regions  are  assigned  random  destination,  the  union 
of  the  migrated  boundary  of  the  mesh  entities  being  sent 
becomes  proportional  to  the  number  of  regions  sent.  In 
the  case  of  s  =  1,  the  migrated  boundary  is  proportional 
to  the  surface  of  the  mesh  entities  sent.  As  a  result, 
since  the  cost  of  stages  2  and  3  of  the  mesh  migration 
algorithm  is  dependent  on  the  size  of  migrated  boundary, 
these  stages  contribute  greatly  to  the  drop  in  cases  s  >  1. 
The  sets  of  regions  which  ar^migfated  in  practice  are 
clustered  locally  and  hence  the  migrated  boundary  size  is 
rarely  proportional  to  the  volume  being  sent.  Therefore, 
higher  throughput  rates  can  be  attained  for  larger  s  as  is 
evident  from  Test  1  above. 

This  section  discussed  the  data  structures  and  the  migra¬ 
tion  routines  used  in  the  PMDB  library.  PMDB  library  cur- 
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(a) 


number  of  processors 
(b) 

Figure  10.  Neighborhood  migration  test  for  box 
mesh  ;  total  number  of  regions  migrated  by  all 
processors  (a),  throughput  per  processor  (b) 

rently  supports  triangular  and  tetrahedral  meshes.  How¬ 
ever,  the  data  structures  and  the  mesh  migration  proce¬ 
dures  easily  extend  to  other  types  of  elements  such  as 
quads,  bricks  or  mixed  meshes.  Further  fine  tunings  are 
also  possible  which  can  reduce  memory  requirements  and 
improve  the  throughput  of  the  migration  procedure  by, 
for  example,  generating  ownership  corresponding  to  the 
target  processor  for  entities  on  migrated  model  boundary. 

2.3.  Dynamic  Load  Balancing  of 
Adaptively  Evolving  Meshes 

The  evolving  nature  of  an  adaptive  discretization  intro¬ 
duces  load  imbalance  into  the  solution  process.  There¬ 
fore,  it  is  critical  that  the  load  be  dynamically  rebalanced 
as  the  adaptive  calculation  proceeds.  The  current  reper¬ 
toire  of  partitioning  and  dynamic  redistribution  heuristics 
for  unstructured  meshes  can  be  classified  into  three  main 
categories  given  as  follows: 

The  most  popular  category  involves  Recursive  Bisection 
(RB)  techniques  which  repeatedly  split  the  mesh  into  two- 
submeshes.  Coordinate  RB  methods  bisect  the  elements 
by  their  spatial  coordinates.  If  the  axis  of  bisection  is 
Cartesian,  then  it  is  called  Orthogonal  RB  [4].  If  the  axes 
are  chosen  to  be  along  the  principal  axis  of  the  moment 
of  inertia  matrix,  then  it  is  called  Inertial  RB.  Spectral 
RB  is  another  method  which  utilizes  the  properties  of  the 
Laplacian  matrix  [22]  of  the  mesh  connectivity  graph  and 


(b) 

Figure  11.  Migrating  to  s  processor  ;  wall  time 
in  seconds  (a),  throughput  per  processor  (b) 

bisects  it  according  to  the  eigenvector  corresponding  to 
the  second  smallest  eigenvalue  of  this  matrix  [56]. 

The  least  popular  choice  for  partitioning  meshes  is  the 
probabilistic  methods  which  include  simulated  annealing 
and  genetic  algorithms.  These  methods  require  many 
iterations  and  are  expensive  to  use  as  mesh  partitioning 
methods  [91]. 

Iterative  Local  Migration  techniques  have  been  the  tar¬ 
get  of  recent  attention  due  to  their  potential  for  dynami¬ 
cally  balancing  adaptive  meshes  which  change  incremen¬ 
tally.  These  techniques  exchange  load  between  neigh¬ 
boring  processors  to  improve  the  load  balance  and/or  de¬ 
crease  the  communication  volume.  The  definition  of  pro¬ 
cessor  neighborhood  can  either  be  the  hardware  link  or 
the  connectivity  of  the  split  domains.  The  cyclic  pairwise 
exchange  [33]  pairs  processors  connected  by  a  hardware 
link  and  exchanges  the  nodes  of  the  mesh  to  improve 
the  communication.  Leiss/Reddy  [43]  on  the  other  hand 
uses  the  hardware  link  as  the  neighborhood  to  transfer 
work  from  heavily  loaded  to  less  loaded  processors.  The 
Tiling  system  [18]  uses  and  extends  the  Leiss/Reddy 
algorithm  to  the  case  where  the  neighborhood  is  defined 
by  the  connectivity  of  the  split  domains.  The  algorithm 
of  Lohner  et  al.  [50]  exchanges  elements  between  sub- 
domains  according  to  a  deficit  difference  function  which 
reflects  the  imbalance  between  an  element  and  its  neigh¬ 
bors.  The  procedure  by  Vidwans  et  al.  [85]  uses  a  divide 
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and  conquer  approach  to  pair  processors  and  uses  connec¬ 
tivity  as  well  as  coordinate  information  to  decide  which 
elements  to  migrate. 

A  disadvantage  of  the  common  implementations  RB 
methods  is  they  start  with  the  entire  mesh  on  a  single 
processor  and  partition  from  there.  Two  problems  with 
this  approach  in  a  parallel  adaptive  calculation  are  (i)  the 
time  required  to  gather  the  distributed  mesh  together  on 
a  single  processor,  and  (ii)  the  fact  that  after  the  mesh 
has  been  adapted,  it  may  have  grown  to  the  point  that  it 
can  not  fit  on  a  single  processor.  These  problems  can  be 
alleviated  it  the  mesh  remains  distributed  during  the  repar¬ 
titioning  process.  The  next  subsection  discusses  a  parallel 
implementation  of  Inertial  Recursive  Bisection  that  oper¬ 
ates  on  a  distributed  mesh. 

RB  methods  operate  on  the  whole  mesh  and  compute  the 
direct  destination  for  each  element.  Because  of  this,  it  is 
possible  that  RB  methods  may  require  complete  remap¬ 
ping  of  the  elements  at  the  end.  On  the  other  hand,  it¬ 
erative  local  migration  techniques  propagate  the  excess 
load  by  local  transfers  to  other  processors.  A  disadvan¬ 
tage  of  iterative  local  migration  techniques  is  that  many 
iterations  may  be  required  to  regain  global  balance  and 
hence  elements  reach  their  final  destination  after  many 
local  transfers  rather  than  directly.  In  particular,  when 
elements  are  migrated,  the  full  element  data  involving 
connectivity  and  local  attached  data  are  communicated. 
For  parallel  repartitioners  based  on  coordinate  bisection, 
only  the  centroids  and  region  pointers  need  to  be  com¬ 
municated  during  a  parallel  sorting  phase.  As  a  result 
this  class  of  repartitioners  may  have  better  performance 
on  machines  in  which  the  communication  between  any 
pair  of  processors  is  distance-independent. 

Subsection  2.3.2  presents  an  iterative  load  balancing  pro¬ 
cedure  based  on  the  Leiss/Reddy  heuristic  of  requesting 
load  from  the  most  heavily  loaded  neighbor.  The  perfor¬ 
mance  of  this  procedure  is  compared  with  repartitioning 
by  the  parallel  distributed  inertia  recursive  bisection  al¬ 
gorithm. 

2.3.1  Geometry-Based  Dynamic 
Balancing  Procedures 

Geometry-based  dynamic  balancing  (or  repartitioning)  re¬ 
lies  here  on  the  Inertial  Recursive  Bisection  (IRB)  method 
[50]  which  is  a  variation  of  the  more  classic  Orthogo¬ 
nal  Recursive  Bisection  (ORB)  [4].  ORB  is  a  recursive 
process  that  bisects  a  set  of  entities  by  considering  the  me¬ 
dian  of  the  set  of  corresponding  centroids  with  respect  to 
a  given  coordinate  axis.  As  ORB  is  recursively  called,  the 
choice  of  coordinate  axis  is  circularly  permuted  (x,y,z,x, 
etc).  Unlike  ORB,  IRB  considers  the  inertial  coordinate 
system  (origin  is  at  the  center  of  gravity  and  the  three  axes 
are  the  principal  axes  of  inertia)  for  the  set  of  entities  to 
be  bisected.  In  three  dimensions,  the  determination  of  the 
three  principal  axes  of  inertia  is  an  eigenvalue  problem  of 
order  3.  Once  the  inertial  coordinate  system  is  defined, 
the  coordinates  of  the  centroids  are  transformed  and  the 
cut  is  made  at  the  median  with  respect  to  the  first  coor¬ 


dinate.  This  first  coordinate  is  the  “key”  that  the  sorting 
algorithm  described  later  in  this  section  works  on. 

The  main  assumption  for  performing  repartitioning  in 
parallel  is  that  the  entities  are  distributed.  It  is  also 
assumed  that  there  is  no  reason  for  the  number  of  entities 
stored  on  processor  to  be  uniform  across  processors.  The 
result  of  this  repartitioning  will  be  an  equal  number  of 
entities  per  processor.  It  should  be  noted  that,  in  this 
context,  the  goal  of  repartitioning  is  equivalent  to  the 
goal  of  dynamic  load  balancing  [15,  55,  73,  54,  43,  85]. 
The  key  algorithm  in  IRB  (and  ORB)  is  the  determination 
of  the  median  for  a  given  set  of  doubles  (referred  to  as 
“keys”)  [68].  With  respect  to  this  paper,  the  “keys”  are 
the  first  coordinates,  in  the  inertial  frame,  of  the  entities 
to  be  bisected.  The  method  used  here  is  to  sort  the  “keys” 
and  then  pick  the  entry  at  the  middle  of  the  sorted  list. 
In  this  case,  efficiently  performing  IRB  in  parallel  can  be 
reduced  to  efficiently  sorting  in  parallel  [34].  From  the 
conclusions  of  the  paper  by  Blelloch  et  al.  [8]  which 
compares  different  parallel  sorting  algorithms  (Batcher’s 
bitonic  sort,  radix  sort,  and  sample  sort),  it  appears  that 
the  sample  sort  algorithm  is  the  fastest  of  the  three  for 
large  data  sets.  Therefore,  a  parallel  sample  sort  algorithm 
has  been  implemented  in  order  to  efficiently  support  IRB. 
Given  a  set  of  n  “keys”  distributed  on  p  processors  (n  » 
p),  a  sample  sort  algorithm  consists  of  three  main  steps; 

1.  p-I  splitters  (or  pivots)  are  chosen  among  the  n 
“keys” 

2.  Each  key  is  routed  to  the  processor  corresponding  to 
the  bucket  the  “key”  is  in 

3.  Keys  are  sorted  within  each  bucket  (no  communi¬ 
cation) 

The  goal  of  step  1  is  to  split  the  set  of  “keys”  into  p 
parts  (buckets)  as  evenly  as  possible  and  as  efficiently 
as  possible.  The  p-I  splitters  which  are  implicitly  sorted 
(say  with  respect  to  increasing  value)  are  labeled  from 
1  to  p-I.  All  distributed  “keys”  below  splitter  1  belong 
to  bucket  0,  all  distributed  “keys”  between  splitter  i  (0 
<  i  <  p-1)  and  splitter  i+I  belong  to  bucket  i,  and  all 
distributed  “keys”  above  splitter  p-1  belong  to  bucket  p- 
1.  Processor  i  (0  <  i  <  p)  is  responsible  for  the  bucket 
labeled  i.  In  step  2,  assuming  the  p-1  splitters  have  been 
found  and  broadcasted  to  all  processors,  any  distributed 
“key”  can  tell  in  which  bucket  it  belongs  and  is  rerouted 
to  the  processor  that  is  responsible  for  that  bucket.  At 
this  point,  any  processor  has  knowledge  of  all  “keys” 
that  belong  to  the  bucket  it  has  been  assigned  to.  Step 
3  can  be  performed  using  any  efficient  sequential  sorting 
algorithm,  like  quicksort  [68].  It  is  clear  that  the  parallel 
efficiency  of  the  sample  sort  algorithm  depends  on  the 
sizes  of  the  buckets.  Parallel  efficiency  is  maximal  when 
the  sizes  of  the  buckets  are  near  equal.  A  sampling 
method  is  used  to  obtain  “good”  splitters.  Given  the  n 
input  “keys”,  ps  “keys”  {s  is  an  integer  >  1  called  the  over 
sampling  ratio)  are  selected  at  random  and  sorted  typically 
sequentially.  The  entries  in  the  sorted  list  of  ranks  s,  Is, 
...  ,  {p-l)s  are  the  p-1  splitters.  The  bound  for  bucket 
expansion  (ratio  of  maximum  bucket  size  to  average)  is 
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given  in  the  paper  by  Blelloch  et  al  [8],  In  practice,  the 
over  sampling  ratio  should  be  such  that  the  sorting  to  find 
the  splitters  (which  is  done  serially)  does  not  become  a 
bottleneck  for  the  global  parallel  sample  sort  algorithm. 
For  the  purpose  of  the  presented  repartitioning  technique, 
the  over  sampling  ratio  is  chosen  such  that  ps  is  of  the 
order  of  nip  {nip  being  of  the  order  of  the  number  of 
“keys”  to  sort  in  step  3). 

The  following  pseudo-code  shows  the  process  of  reparti¬ 
tioning  using  IRB  in  parallel.  It  is  assumed  that  the  enti¬ 
ties  are  already  distributed  on  processors.  A  statement  of 
the  form  for  ( i  =  0  ;  i  <  n  ;  i++  )  {  ...  }  indicates  a  loop 
that  is  executed  as  long  as  the  loop  variable  i,  initially 
set  to  0  ((■  =  0)  and  incremented  by  1  upon  completion  of 
each  pass  (i-i-i-),  has  a  value  less  than  n  (i  <  n)  [40].  Each 
processor  executes  the  following  pseudo-code  (MIMD): 

1.  Associate  each  entity  with  a  “key”  structure  consist¬ 
ing  of: 

a.  3  doubles  for  the  coordinates  of  the  entity’s  cen¬ 
troid  with  respect  to  the  current  inertial  coordi¬ 
nate  system  (initially  with  respect  to  original 
coordinate  system) 

b.  1  integer  that  indicates  on  which  processor  the 
actual  entity  is  stored 

c.  1  pointer  to  the  entity 

d.  1  integer  that  indicates  the  destination  processor 
for  the  entity 

2.  for  (  step  =  0  ;  step  <  \0g2p  ;  step  +4-  )  { 

a.  Split  the  p  processors  into  2^'^^  processor  sets 
(each  set  is  of  cardinality  p’  - 

b.  Balance  the  load  such  that  each  processor 
has  approximately  the  same  number  of  keys 
(reroute  the  keys  accordingly) 

c.  Get  center  of  gravity,  find  the  three  principal 
axes  of  inertia,  and  apply  transformation  to  the 
keys 

d.  Get  p’  —  1  splitters  among  the  keys 

e.  Depending  on  the  position  with  respect  to  the 
splitters,  determine  in  which  bucket  (processor) 
each  key  goes  (reroute  the  keys  accordingly) 

f.  Sort  the  keys  (no  communication) 

g.  Depending  on  the  position  with  respect  to  the 
median,  determine  in  which  bucket  (processor) 
each  key  goes  (reroute  the  keys  accordingly) 

h.  Free  the  processor  sets 

} 

3.  The  destination  processor  is  set  to  the  processor  the 

key  is  currently  in 

4.  Reroute  all  keys  to  the  originating  processors 

5.  Migrate  entities  according  to  the  destination  proces¬ 
sor  stored  at  the  key  level 

Steps  2.b  through  2.g  are  done  independently  on  each 
processor  set.  Once  all  keys  have  been  sorted  in  the 
processor  set  (at  the  end  of  step  2.f),  the  median  (key  that 
splits  the  set  of  keys  into  two  subsets  of  same  cardinality) 
is  easily  obtained.  Any  key  that  is  before  the  median  is 


placed  (if  not  already  there)  on  a  processor  with  a  low 
rank  (0  lop’H  —  1)  and  any  key  that  is  after  the  median 
is  placed  (if  not  already  there)  on  a  processor  with  a  high 
rank  (p 72  top’  —  1).  This  guarantees  that  any  key  stored 
on  a  processor  set  is  smaller  that  any  key  in  the  next 
processor  set.  Figure  12  is  a  graphical  depiction  of  steps 
2.b  through  2.g  in  the  case  when  p’  equals  two.  At  each 
step,  the  array  of  keys  (distributed  across  the  processors 
in  the  set)  is  represented  by  a  horizontal  line  which  is  cut 
to  show  how  it  is  currently  distributed.  The  symbol  < 
indicates  that  the  keys  in  the  array  are  not  sorted  if  above 
the  processor  cutter,  it  also  indicates  that  any  key  in  the 
left  processor’s  array  is  smaller  than  any  key  in  the  right 
processor’s  array.  If  there  is  no  such  symbol,  the  keys 
are  not  sorted  yet. 

Initial  state 

- 1 - 

Balance  -  Transform  •  Get  splitters 
- 1 - 

Put  in  buckets  (splitters) 

- 1 - 

< 

Sort 

- 1 - 

<  <  < 

Put  in  buckets  (median) 

^  Median-*^  ^ 

Figure  12.  Graphical  description  of  the 
repartitioning  algorithm  (2-processor  set) 

Figure  13  shows  a  randomly  distributed  mesh  (approx¬ 
imately  35,000  elements)  and  the  resulting  dynamically 
repartitioned  mesh  for  eight  processors.  Figure  14  shows 
timings  (wall-clock  seconds  on  IBM  sp-2)  for  that  partic¬ 
ular  mesh  on  2,  4,  8,  and  16  processors.  The  processor 
assignment  timing  corresponds  to  steps  1  to  4  (decision 
making).  The  migration  timing  corresponds  to  step  5.  It 
should  be  noted  that  a  randomized  mesh  as  the  initial  state 
is  a  worst-case  scenario  for  the  migration  part  of  the  repar¬ 
titioning  procedure.  Past  four  processors,  the  time  spent 
decreases  as  the  number  of  processors  increases,  which  is 
a  good  indication  of  scalability.  It  is  conjectured  that  the 
“abnormal”  speed  with  two  processors  is  due  to  the  fact 
that  (i)  the  only  processor  set  ever  used  is  the  full  set  of 
processors  and  (ii)  there  is  some  performance  degradation 
when  more  than  one  processor  set  is  defined. 

2.3.2  Topologically-Based  Dynamic 
Balancing  Procedures 

Tree  Based  Load  Balancing  Algorithm  The  Tiling 
system  which  uses  the  Leiss/Reddy  approach  calculates 
the  load  averages  utilizing  the  immediate  neighborhood. 
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Figure  13.  Dynamic  repartitioning 
on  a  randomly  distributed  mesh 


Figure  14.  Timings  for  dynamic  repartitioning 

To  incorporate  more  global  information  and  to  direct  load 
transfers,  we  view  the  processor  requests  for  load  from 
heavily  loaded  processors  as  forming  a  forest  of  trees. 


Figure  15(a)  shows  an  example  of  requests  that  can  be 
formed.  Given  this  hierarchical  arrangement  of  proces¬ 
sors  as  the  nodes  of  trees,  we  balance  the  trees  as  shown 
in  Figure  15(b)  and  iteratively  repeat  the  process  until  the 
load  distribution  converges  to  optimal  load  balance  within 
a  user  supplied  tolerance.  The  full  algorithm  is  given  in 
Figure  16.  The  procedure  details  are  given  as  follows. 
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Figure  15.  Load  balancing  example;  load 
request  (a)  load  migration  on  the  tree  (b) 


procedure  tree_load_balance(fo/(oad,  maxuer) 

in  to/(oQrfimbalance  load  tolerance 
in  maxiter  ■  maximum  number  of  iterations 

begin 

1  iter  =  0 

2  while  (max.  load  difference  >  tolioad  )  and 
(iter  <  maxiter)  do 

3  iter  =  iter  -i-  1 

4  Compute  neighboring  load  differences. 

5  Request  load  from  neighbor  processor  having 

largest  load  difference  (creates  processor  trees). 

6  Linearize  processor  trees. 

7  Compute  amounts  of  load  migration. 

8  Select  and  migrate  load. 

9  endwhile 
end 

Figure  16.  Tree  based  dynamic  load  balancing  procedure 

Steps  of  the  procedure  The  steps  of  balancing  the  for¬ 
est  of  trees  are  repeated  until  convergence  is  achieved. 
Assuming  that  load  transfer  occurs  when  there  is  a  load 
difference  of  at  least  two  units,  Leiss/Reddy’s  algorithm 
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has  worst  case  imbalance  of  (i/2  where  d  is  the  diameter 
of  the  network.  In  such  a  converged  state,  all  the  pro¬ 
cessors  have  a  load  difference  of  one  with  its  neighbors. 
This  configuration  forms  a  staircase  load  distribution.  For 
our  tree_load_balance,  a  staircase  will  not  be  the 
worst  case  distribution,  but  rather  a  forest  of  trees  each  of 
which  has  a  maximum  height  of  two  and  load  difference 
of  one.  In  such  a  state,  the  worst  case  imbalance  will 
be  ci/4.  This  kind  of  imbalance  can  be  tolerated  on  a 
coarse  grain  machine.  For  example,  a  lOOK  mesh  on  64 
processors  will  imply  a  worst  case  of  1.02%  imbalance. 
In  step  4,  load  differences  are  computed  by  having  each 
processor  send  its  load  value  to  its  neighbors  and  corre¬ 
spondingly  receive  load  values  from  its  neighbors. 

Step  5  invokes  the  Leiss/Reddy  load  request  process. 
Since  each  processor  can  receive  requests  from  multiple 
processors,  but  can  only  request  from  a  single  processor, 
a  forest  of  trees  is  formed. 

In  step  6,  the  trees  are  linearized  for  efficient  scan  op¬ 
erations.  One  possible  linearization  is  given  by  Euler 
Tour  [34].  This  however  requires  2(|T|  —  1)  links  where 
|T|  denotes  the  number  of  vertices  on  a  tree.  We  use 
the  depth-first-links  [41][83]  which  use  between  \T\  and 
2(|T|  -  1)  number  of  links. 

Step  7  computes  the  amounts  of  load  migrations  on  the 
tree  using  logarithmic  scan  operations  on  the  linearized 
tree.  Let  loadjmigi  denote  amount  of  load  that  will  be 
migrated  into  or  out  of  a  tree  node  i  which  represents  a 
processor.  Let  also  Ti  denote  the  subtree  with  node  i  as 
the  root  of  the  subtree  and  load{Ti)  be  the  sum  of  loads 
of  nodes  in  this  subtree.  The  amount  of  load  migration 
is  then  calculated  as 

load.migi  —  load{Ti)  —  avgJoad{T)  *  \Ti\ 

with  avgJoad{T)  =  load{T)/\T\  representing  the  aver¬ 
age  load  on  the  tree  when  balanced.  Given  load-migi, 
the  direction  of  load  migrations  can  be  found  as 

loadjmigi  =  0,  do  nothing  with  parent, 

<0,  get  load  from  parent, 

>  0,  send  load  to  parent. 

Having  calculated  the  directions  of  load  migration,  step 
8  migrates  the  elements  on  the  partition  boundary  in 
a  slice  by  slice  manner  until  loadjmigi  of  them  has 
been  transferred.  Each  slice  of  elements  forms  a  peeling 
of  the  partition  boundary  and  are  selected  by  choosing 
elements  which  touch  the  boundary  by  any  one  of  their 
vertices.  Figure  17  shows  the  element  selection  criteria 
for  migration. 

Examples  In  this  section  we  plot  various  statistics  that 
show  the  performance  of  the  load  balancing  on  refined 
meshes  and  compare  them  with  coordinate  based  repar¬ 
titioning. 

Test  1:  In  this  test  a  patella  mesh  is  refined  manually 
in  the  center  of  the  model.  At  the  beginning,  the  mesh 


Figure  17.  Selecting  elements  for  migration 
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Figure  1 8.  The  convergence  history  for  load  balance 

has  8497  tetrahedrons.  After  refinement,  the  number  of 
tetrahedrons  increases  to  15329.  In  Figure  4,  we  plot 
the  convergence  history  of  the  tree_load_balance 
for  the  16  processor  run.  The  scheme  converges  in  3 
iterations. 

In  Figure  19,  we  plot  the  times  taken  for  the 
tree_load_balance,  the  moment  of  inertia  paral¬ 
lel  recursive  bisection  routine  and  the  parallel  inertia 
repartitioner.  The  parallel  recursive  bisection  starts  with 
the  whole  mesh  on  one  processor  and  recursively  splits 
it  in  parallel.  The  repartitioner  employs  a  parallel  sort 
routine.  The  parallel  repartitioner  outperforms  the  other 
two  strategies.  There  are  various  reasons  why  the  inertia 
repartitioner  outperforms  the  load  balancer.  Firstly,  the 
performance  of  the  load  balancer  is  greatly  affected  by 
the  distances  between  the  heavily  loaded  and  underloaded 
processors.  For  example,  suppose  there  is  imbalance 
due  to  refinement  at  the  corner  of  a  model  on  one  of 
the  processors.  In  such  a  case,  to  propagate  the  excess 
load  to  the  rest  of  the  processors  by  local  neighborhood 
transfers,  one  can  see  that  the  number  of  steps  needed 
will  be  at  least  the  size  of  the  diameter  of  partition  graph. 
This  is  because  the  load  will  be  transferred  one  step  at  a 
time  under  the  iterative  load  balancer.  For  other  types 
of  distributions  in  which  the  distance  between  the  lightly 
loaded  to  heavily  loaded  processors  is  small  and  there  is  a 


6-17 


high  frequency  of  load  imbalances,  the  load  balancer  will 
have  better  performance.  The  repartitioner  bypasses  the 
effects  of  distance  by  directly  sending  load  from  heavily 
loaded  to  lightly  loaded  processors.  On  an  architecture 
such  as  the  IBM-SP2,  in  which  communication  cost 
is  independent  of  the  distance  between  the  processors 
and  hence  the  same  between  any  pair  of  processors,  the 
repartitioner  will  be  advantageous  since  it  directly  sends 
the  load  to  its  final  destination.  The  load  balancer  will 
be  disadvantageous  since  it  will  incur  expensive  latency 
cost  during  many  local  transfers  it  performs. 


Figure  19.  Time  taken  for  load  balance, 
parallel  repartitioning  and  bisection 


Finally,  Figure  20  shows  the  quality  of  the  partitions 
produced  in  terms  of  maximum  and  total  percentage  of 
faces  cut.  The  load  balancer’s  element  selection  criteria 
for  migration  dictates  the  quality  of  the  partitions.  The 
criteria  currently  used  can  be  improved  by  incorporating 
coordinate  information  to  selection  decision. 


Figure  20.  The  total  and  maximum 
number  of  cut  faces  for  each  method 


Test  2:  In  this  test,  various  statistics  are  reported  for  the 
adaptively  refined  onera-in6  wing  mesh  during  an  ac¬ 
tual  CFD  analysis  on  32  processors.  At  the  beginning 
the  mesh  has  85567  tetrahedrons.  Three  stages  of  adap¬ 
tive  refinements  are  performed  during  which  the  number 


of  tetrahedrons  increase  to  131000,  223501  and  finally 
to  388837.  Figure  21  shows  the  convergence  history  of 
the  iterative  load  balancer.  In  all  cases  of  load  balanc¬ 
ing  after  refinement,  the  imbalance  reduces  to  less  than 
4%  during  the  first  8  iterations  and  takes  far  more  num¬ 
ber  of  iterations  to  reduce  this  imbalance  further  to  0% 
imbalance.  One  need  not  run  the  tree_balance  to 
full  convergence.  It  can  be  stopped  when  a  reasonable 
imbalance  is  achieved. 


Figure  21.  The  convergence  history  for  load 
balance  during  three  stages  of  refinement 


Table  2  shows  the  execution  time  comparisons  between 
the  tree_load_balance  and  the  parallel  moment  of 
inertia  partitioner.  In  all  cases,  moment  of  inertia  out¬ 
performs  tree_load_balance  for  the  same  reasons 
which  was  explained  in  Test  1. 


refinement 

1st 

2nd 

3rd 

percent  imbalance 

1.7 

0 

3.8 

0 

1.9 

0 

tree_balance  (sec) 

73 

85 

127 

210 

189 

283 

inertia  partition  (sec) 

21 

48 

128 

Table  2  Execution  times  (in  seconds)  for 
tree_load_  balance  and  inertia  partitioner 


Finally,  Table  3  shows  partition  quality  comparisons  be¬ 
tween  the  tree_load_balance  and  the  moment  of 
inertia  partitioner.  The  percentage  of  maximum  num¬ 
ber  and  the  total  number  of  cut  faces  are  given  for  both 
tree_load_balance  and  the  inertia  partitioner. 


refinement 

1st 

2nd 

3rd 

percent  cuts 

max 

total 

max 

total 

max 

total 

tree_balance 

24 

10 

26 

11 

25 

10 

inertia  partition 

26 

7 

25 

7 

19 

6 

Table  3  Maximum  and  total  percentage  of  cut  faces  for 
tree_load_balance  and  inertia  partitioner 
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3.  Parallel  Automatic  Mesh  Generation 
3.1.  Introduction 

The  development  of  automatic  mesh  generation  tech¬ 
niques  for  complex  three-dimensional  configurations  has 
been  an  active  area  of  research  for  over  a  decade  [26,  78]. 
The  introduction  of  these  mesh  generation  procedures  has 
removed  a  major  bottleneck  in  the  application  of  finite 
element  and  finite  volume  analysis  techniques.  The  in¬ 
troduction  of  scalable  parallel  computers  is  allowing  the 
solution  of  ever  larger  models.  It  is  now  common  to  see 
meshes  of  several  million  elements  solved  on  these  com¬ 
puters,  with  the  ability  to  solve  on  meshes  of  hundreds  of 
millions  of  elements  coming  in  the  near  future.  As  mesh 
sizes  become  this  large,  the  process  of  mesh  generating 
on  a  serial  computer  becomes  problematic  both  in  terms 
of  time  and  storage.  Therefore,  parallel  mesh  generation 
procedures  that  operate  on  the  same  computer,  and  us¬ 
ing  similar  structures,  as  the  parallel  analysis  procedures 
must  be  developed 

With  recent  advances  in  the  efficiency  of  automatic  mesh 
generators  which  create  well  over  two  million  elements 
per  hour  on  a  workstation  [88],  one  may  question  the 
need  for  the  parallel  generation  of  meshes.  The  obvious 
answer  is  that  as  the  problem  size  grows,  the  solution 
process  on  parallel  computers  will  continue  to  scale  by  the 
addition  of  more  processors.  However,  mesh  generation 
on  a  single  processor  will  not  scale,  therefore  becoming 
the  computational  bottleneck.  A  second  critical  reason  for 
parallel  mesh  generation  is  the  shortage  of  memory  on  a 
sequential  machine  when  dealing  with  very  large  meshes. 
On  a  parallel  machine,  the  memory  problem  is  addressed 
by  distributing  the  mesh  over  a  number  of  processors, 
each  of  which  stores  its  own  portion  of  the  mesh. 
Efficient  parallel  algorithms  require  a  balance  of  work 
load  among  the  processors  while  maintaining  interproces¬ 
sor  communication  at  a  minimum.  Key  to  determining 
and  distributing  the  work  load  and  controlling  commu¬ 
nications  is  knowledge  of  the  structure  of  the  calcula¬ 
tions  and  communications.  In  the  finite  element  analysis 
process,  the  mesh  and  its  connectivity  naturally  provide 
the  required  structure.  The  ability  to  maintain  efficiency 
is  compromised  when  the  structure  and,  therefore,  work 
load  and  communications  is  altered  as  is  the  case  in  par¬ 
allel  adaptive  finite  element  analysis  [15,  55,  73,  54]. 
Parallel  mesh  generation  is  even  more  complex  to  effec¬ 
tively  control  since  the  only  structure  known  at  the  start 
of  the  process  is  that  of  the  geometric  model  which  has 
no  discernible  relationship  to  the  work  load  needed  to 
generate  the  mesh.  On  the  other  hand,  the  more  useful 
structure  to  discern  work  load  and  control  communica¬ 
tions  is  the  mesh  which  is  only  fully  known  at  the  end 
of  the  process.  The  lack  of  initial  structure  and  ability  to 
accurately  predict  work  load  during  the  meshing  process 
underlies  the  selection  of  algorithmic  procedures  in  the 
parallel  mesh  generation  procedure  presented  here.  In 
particular,  the  procedure  employs  an  octree  decomposi¬ 
tion  of  the  domain  to  control  the  meshing  process.  The 


octree  structure  supports  the  distribution  or  redistribution 
of  computational  effort  to  processors. 

3.2.  Background  and  Meshing  Approach 

To  date,  there  has  been  limited  attention  given  to  parallel 
automatic  mesh  generation  algorithms.  Ldhner  et  al  [49] 
have  parallelized  a  two-dimensional  advancing  front  pro¬ 
cedure  which  starts  from  a  pre-triangulated  model  bound¬ 
ary.  The  approach  taken  is  to  subdivide  (partition)  the 
domain  (with  the  help  of  a  background  grid)  and  distrib¬ 
ute  the  sub-domains  to  different  processors  for  triangu¬ 
lation.  The  interior  of  subdomains  are  meshed  indepen¬ 
dently.  Then,  the  inter-subdomain  regions  are  meshed 
using  a  coloring  technique  to  avoid  conflicts.  Finally,  the  < 
“corners”  between  more  than  two  processors  are  meshed 
following  the  same  basic  strategy.  A  “one  master-many 
slaves”  paradigm  has  been  chosen  to  drive  the  parallel 
procedures.  This  approach  has  been  extended  to  three 
dimensions  with  some  modifications  [79].  A  load  bal¬ 
ancing  phase  follows  the  initial  domain  splitting  (at  the 
background  grid  level).  The  interface  gridding  incorpo¬ 
rates  mechanisms  (i)  to  avoid  degradation  of  performance 
by  using  fine  grain  parallelism  and  (ii)  to  reduce  the  num¬ 
ber  of  processors  when  there  is  too  much  communication 
overhead.  Results  show  scalability  of  the  method. 

Saxena  and  Perruchio  [64]  describe  a  parallel  Recursive 
Spatial  Decomposition  (RSD)  scheme  which  discretizes 
the  model  into  a  set  of  octree  cells.  Interior  and  bound¬ 
ary  cells  are  meshed  by  either  using  templates  or  element 
extraction  (removal)  schemes  in  parallel.  The  algorith¬ 
mic  procedure  they  employ  to  create  these  octant  level 
meshes  requires  no  communication  between  octants.  The 
main  difficulty  for  this  meshing  approach  is  to  guarantee 
that  a  boundary  octant  can  always  be  meshed  regardless 
of  the  eomplexity  of  the  model.  Robust  loop  building  al¬ 
gorithms  which  include  possible  tree  refinement  to  resolve 
invalid  configurations  are  in  general  difficult  to  parallelize 
[76].  Parallel  results  have  been  simulated  on  a  sequential 
machine. 

The  parallel  mesh  generator  presented  here  builds  upon 
previous  work  on  sequential  octree-based  mesh  generators 
[66,  76,  77],  parallel  adaptive  finite  element  analysis 
procedures  [15,  55,  73],  and  parallel  mesh  generation 
[16].  It  meshes  three-dimensional  non-manifold  objects 
following  the  hierarchy  of  topological  entities.  That  is, 
the  model  edges  are  meshed  first,  the  model  faces  are 
meshed  second,  and  the  model  regions  are  meshed  last. 

The  current  discussion  focuses  on  the  octree-based  region 
meshing  procedure. 

Figure  22  graphically  depicts  the  basics  of  the  present 
mesh  generator.  The  first  step  in  meshing  a  model  re¬ 
gion  is  to  develop  a  variable  level  octree  which  reflects 
the  mesh  control  information  and  is  consistent  with  the 
triangulation  on  the  boundary  of  the  model  region.  Oc¬ 
tants  containing  mesh  entities  classified  on  the  boundary 
of  the  model  region  to  be  meshed  are  constructed  to  be 
approximately  of  the  same  size  as  the  mesh  entities  they 
contain.  A  one  level  difference  on  octants  sharing  one 
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or  more  edges  is  enforced  during  this  process  to  con¬ 
trol  smoothness  of  the  mesh  gradations.  Once  the  octree 
is  generated,  the  octants  are  classified  as  interior,  out¬ 
side,  or  boundary.  Those  classified  as  outside  receive 
no  further  consideration.  Some  interior  octants  are  re¬ 
classified  boundary  if  they  are  too  close  to  mesh  entities 
classified  on  the  boundary  of  the  model  region  {boundary- 
interior).  The  purpose  of  this  reclassification  is  to  avoid 
the  complexities  caused  when  interior  octant  mesh  enti¬ 
ties  (coming  from  the  application  of  templates)  are  too 
close  to  the  boundary  and  may  lead  to  the  creation  of 
poorly  shaped  elements  in  that  neighborhood.  Interior 
octants  are  meshed  using  templates.  Face  removal  proce¬ 
dures  are  then  used  to  connect  the  boundary  triangulation 
to  the  interior  octants.  Figure  23  graphically  describes  a 
face  removal  in  a  two-dimensional  setting. 


Figure  22.  Graphical  depiction  of  the 
basics  of  the  presented  mesh  generator 


Mesh  faces  to  which  tetrahedral  elements  will  eventually 
be  connected  are  referred  to  as  partially  connected  faces. 
They  are  basically  missing  one  connected  tetrahedron 
in  the  manifold  case,  and  one  or  two  in  non-manifold 
situations.  Initially,  the  mesh  faces  classified  on  the 
model  boundary  are  the  partially  connected  mesh  faces. 
Once  templates  have  been  applied,  that  is,  at  the  start  of 
face  removal,  the  interior  mesh  faces  connected  to  exactly 
one  tetrahedron  are  also  partially  connected  mesh  faces. 
In  the  remainder  of  this  discussion,  the  current  set  of 
partially  connected  mesh  faces  will  be  referred  to  as  the 
front.  During  face  removal,  tetrahedra  are  connected  to 
these  faces,  therefore  eliminating  them.  Any  non-existing 
face  of  a  newly  created  tetrahedra,  referred  to  as  a  new 
face,  is  a  partially  connected  face  until  it  is  eliminated. 
The  face  removal  process  is  complete  when  there  are  no 
partially  connected  mesh  faces  remaining. 

3.3.1  Underlying  Octree 

The  octree  is  built  over  the  given  surface  mesh  to  (i)  help 
in  localizing  the  mesh  entities  of  interest,  and  (ii)  pro¬ 
vide  support  for  the  use  of  fast  octant  meshing  templates. 
Proper  localization  is  achieved  by  having  each  terminal 
octant  reference  any  partially  connected  mesh  face  which 
is  either  totally  or  partially  inside  its  volume.  This  infor¬ 
mation  is  used  to  efficiently  guarantee  the  correctness  of 
the  face  removal  technique.  The  octree  building  process 
can  be  decomposed  into:  (i)  root  octant  building,  (ii)  oc¬ 
tree  building,  (iii)  level  adjustment,  (iv)  assignment  of 
partially  connected  mesh  faces  to  terminal  octants,  and 
(v)  terminal  octant  classification. 

The  root  octant  is  such  that  the  given  surface  mesh  is 
contained  within  it.  It  is  cubic  in  order  to  avoid  the 
creation  of  unnecessary  stretched  tetrahedra  coming  from 
the  application  of  meshing  templates  on  stretched  octants 
(assuming  isotropy  is  desirable  in  the  resulting  mesh). 
The  terminal  octants  are  constructed  to  be  approximately 
the  same  size  as  any  partially  connected  mesh  face  asso¬ 
ciated  with  them  in  order  to  ensure  appropriate  element 
sizes  and  gradations.  This  is  done  by  visiting  each  mesh 
vertex  in  the  initial  surface  mesh,  computing  the  average 
size  of  the  connected  mesh  edges,  and  refining  the  oc¬ 
tree  until  any  terminal  octant  around  that  vertex  is  at  a 
level  corresponding  to  that  average  size.  The  level  of  the 
octant  is  given  by: 


Figure  23.  Face  removal  (2-D  setting) 

3.3.  Sequential  Region  Meshing 

As  indicated  above,  the  starting  point  for  the  region 
meshing  process  is  a  completely  triangulated  surface. 
The  surface  triangulation  must  satisfy  the  conditions  of 
topological  compatibility  and  geometric  similarity  [67] 
with  respect  to  the  model  faces.  The  region  meshing 
process  consists  of  the  three  steps  of  (i)  generation  of 
the  underlying  octree,  (ii)  template  meshing  of  interior 
octants,  and  (iii)  face  removal  to  connect  the  given  surface 
triangulation  to  the  interior  octants. 


=  (16) 
\  StZ€’  J 

where  rootlength  is  the  length  of  the  root  octant  and 
size  is  the  size  of  the  mesh  entity  (defined  here  as  the 
average  length  of  the  bounding  edges).  It  should  be  noted 
that  this  procedure  does  not  theoretically  ensure  a  match 
in  size  between  every  terminal  octant  and  the  partially 
connected  mesh  faces  it  knows  about. 

To  ensure  a  smooth  gradation  between  octant  levels,  no 
more  than  one  level  of  difference  is  allowed  between 
terminal  octants  that  share  an  octant  edge.  Application 
of  this  rule  can  possibly  lead  to  refinement  of  some 
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terminal  octants  past  the  level  that  was  set  by  the  partially 
connected  mesh  faces  in  their  volumes. 

Once  the  tree  is  completed,  partially  connected  mesh 
faces  are  assigned  to  terminal  octants.  Given  a  mesh 
face,  terminal  octants  that  should  know  about  it  can  be 
separated  into  two  groups:  (i)  those  that  are  in  the  path  of 
each  bounding  mesh  edge  (obtained  by  intersecting  line 
segments  with  axis  aligned  solid  boxes)  and  (ii)  those 
whose  octant  edges  are  in  the  path  of  the  mesh  face 
(obtained  by  intersecting  line  segments  with  triangles). 
Any  terminal  octant  which  knows  at  least  one  partially 
connected  mesh  face  is  classified  boundary.  Terminal 
octants  classified  boundary  separate  interior  terminal  oc¬ 
tants  from  outside  terminal  octants.  At  this  point,  it 
should  be  noted  that  the  interior  of  the  model  can  be 
made  of  several  model  regions.  One  octant  corner  of  a 
boundary  terminal  octant  is  then  classified  either  interior 
or  outside  by  firing  a  ray  toward  a  corner  of  the  root  oc¬ 
tant.  Considering  the  partially  connected  mesh  face  closer 
to  the  octant  corner  among  the  ones  that  intersect  the  ray, 
the  classification  corresponding  to  the  model  region  on  the 
side  of  the  mesh  face  facing  the  octant  corner  is  given  to 
the  octant  corner  [76].  If  there  is  no  intersection,  the  oc¬ 
tant  corner  is  classified  outside.  In  case  the  intersection 
is  on  the  boundary  of  the  partially  connected  mesh  face, 
no  decision  can  be  taken  and  a  ray  to  another  corner  of 
the  root  is  fired.  The  classification  of  the  octant  corner 
is  then  propagated  to  any  neighboring  terminal  octant  (in 
a  recursive  way)  which  has  not  been  classified  yet.  The 
process  of  classifying  an  octant  corner  and  propagating 
its  classification  continues  until  all  terminal  octants  have 
been  classified. 

After  the  basic  octant  classification  process,  interior  ter¬ 
minal  octants  can  exist  which  have  boundary  entities  ar¬ 
bitrarily  close  to  surface  triangles  in  boundary  octants. 
Since  poorly  shaped  elements  can  result  when  these  en¬ 
tities  are  too  close,  some  interior  terminal  octants  are 
reclassified  as  boundary.  If  an  interior  terminal  octant  is 
too  close  to  a  partially  connected  mesh  face,  it  is  reclassi¬ 
fied  boundary.  In  this  discussion,  distances  between  two 
entities  are  always  considered  relative,  that  is,  the  actual 
distance  should  be  divided  by  the  average  size  of  the  enti¬ 
ties  involved.  In  this  particular  case,  the  relative  distance 
between  a  partially  connected  mesh  face  and  an  octant 
is  equal  to  the  absolute  distance  divided  by  the  average 
size  of  the  octant  (its  length)  and  mesh  face.  The  thresh¬ 
old  for  closeness  is  set  to  1 .0,  which  basically  guarantees 
that  there  is  at  least  a  one-element  buffer  between  interior 
terminal  octants  and  surface  triangles. 

3.3.2  Template  Meshing  of  Interior  Octants 

Terminal  octants  classified  interior  are  meshed  using  (i) 
meshing  templates  or  (ii)  fast  meshing  procedures  when 
a  template  is  not  available.  Examination  of  the  number 
of  templates  required  for  all  cases  and  the  distribution 
of  template  usage  indicates  that  octants  with  eight,  nine, 
thirteen,  and  seventeen  vertices  cover  over  90%  of  the 
interior  octants.  All  the  eight,  nine,  thirteen,  and  seven¬ 
teen  vertex  octant  configurations  can  be  meshed  by  six 


templates  (Fig.  24)  with  the  correct  rotations  applied. 
The  remaining  interior  octants  are  then  quickly  meshed 
using  a  fast  procedure  which  accounts  for  the  fact  that 
the  octant  is  a  rectangular  prism.  One  ,  very  fast  option 
is  to  create  an  interior  vertex  and  to  create  the  correct 
connections  to  it  [94]. 


Figure  24.  Terminal  octant  meshing  templates  available: 
one  eight  vertex  case,  two  nine  vertex  cases,  one 
thirteen  vertex  case,  and  two  seventeen  vertex  cases 

3.3.3  Face  Removal 

Given  a  partially  connected  mesh  face,  a  face  removal 
consists  of  connecting  it  to  a  mesh  vertex.  Since  the 
volume  to  be  meshed  consists  of  the  space  between  the 
given  surface  triangulation  and  the  interior  octree,  the 
vertex  used  is  usually  an  existing  one.  However,  in  some 
situations,  it  is  desirable  to  create  a  new  vertex.  The 
choice  of  the  target  vertex  (existing  or  new)  must  be  such 
that  the  created  element  is  of  good  quality  and  its  creation 
does  not  lead  to  poor  (in  terms  of  shape)  subsequent  face 
removals  in  that  neighborhood. 

The  following  pseudo-code  indicates  how  the  target  ver¬ 
tex  is  selected  for  a  given  partially  connected  mesh  face 
to  be  removed.  Detailed  explanation  for  the  key  steps 
is  given  in  the  next  paragraphs  of  the  section.  In  this 
pseudo-code  and  any  other  thereafter,  break  forces  an 
exit  from  a  loop,  return  forces  an  exit  from  the  function 
or  routine  (in  other  words,  the  function  terminates),  and 
text  between  /*  and  */  denotes  a  comment  [40]. 

1.  Collect  set  of  potential  target  vertices  from  tree 
neighborhood 

2.  Reorder  target  vertices  with  respect  to  decreasing 
shape  measure  (for  the  element  to  be  created) 

3.  Initialize: 

a.  dist_lim  -  a 

b.  target_vert  =  0 

c.  max_min_dist  =  0.0 
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4.  for  each  potential  target  vertex  vert  { 

a.  Perform  preliminary  check  on  acceptability.  If 
not  acceptable,  continue 

b.  If  the  new  element  contains  any  mesh  vertex 
belonging  to  the  front,  continue 

c.  If  the  new  element  intersects  any  existing  mesh 
entity,  continue 

d.  Evaluate  how  close  the  new  element  is  to  exist¬ 
ing  mesh  entities  (compute  relative  minimum 
distance  min_dist) 

e.  if  (  min_dist  >  dist_lim  )  { 

•  target_vert  —  vert 

•  max_min_dist=  min_dist 

•  break 

} 

f.  else  if  (  minjdist  >  max_min_dist )  { 

•  target_vert  =  vert 

•  max_min_dist  =  minjdist 

} 

} 

5.  if  (  max_min_dist  >  distjiim  )  return 

6.  if  target_vert  ==  0  { 

a.  Create  a  new  vertex  vert  at  the  best  position  for 
the  partially  connected  mesh  face  to  be  removed 

b.  targetjvert  =  vert 

} 

7.  else  {  /*  Consider  creating  a  new  vertex  */ 

a.  Create  a  new  vertex  vert  at  the  best  position  for 
the  partially  connected  mesh  face  to  be  removed 

b.  Evaluate  closeness  of  new  element  to  existing 
mesh  entities  (minjdist) 

c.  if  (  minjdist  >  max_min_dist  )  target_vert  = 
vert  /*  Better  to  create  a  new  vertex  */ 

} 

The  neighborhood  of  an  entity  is  defined  as  a  tree  neigh¬ 
borhood  of  a  given  order.  Given  a  mesh  entity,  a  tree 
neighborhood  of  order  0  consists  of  all  terminal  octants 
that  know  about  the  entity  (have  the  entity  or  part  of 
it  within  their  volumes).  A  tree  neighborhood  of  order 
n  (n  >  0)  consists  of  a  tree  neighborhood  of  order  n-1 
to  which  is  added  all  terminal  octants  that  neighbor  any 
octant  corner  of  any  terminal  octant  in  the  tree  neigh¬ 
borhood  of  level  n-1.  The  set  of  potential  target  vertices 
is  obtained  via  the  partially  connected  mesh  faces  in  the 
tree  neighborhood  of  the  appropriate  order  for  the  face  in 
consideration.  The  set  of  potential  target  vertices  should 
be  as  small  as  possible  (for  efficiency  reasons)  but  should 
not  be  missing  the  best  target  (with  respect  to  both  shape 
of  new  element  and  closeness  to  nearby  existing  mesh  en¬ 
tities)  assuming  all  mesh  vertices  of  the  front  were  con¬ 
sidered.  A  tree  neighborhood  of  order  0  is  clearly  not 
enough  while  a  tree  neighborhood  of  order  1  is  adequate 
when  the  terminal  octants  have  approximately  the  same 


sizes  as  the  partially  connected  mesh  faces  they  know 
about. 

It  is  of  interest  to  be  able  to  discard  potential  target 
vertices  as  early  as  possible  for  purpose  of  efficiency. 
A  potential  target  is  kept  only  if  it  satisfies  one  of  the 
three  following  conditions  (types): 

1.  connects  to  a  bounding  vertex  of  the  face  to  be 
removed  through  a  mesh  edge  of  the  front.  This 
allows  for  the  removal  of  partially  connected  mesh 
faces  other  than  the  face  in  consideration  (not  in  all 
cases)  and  therefore  leads  to  a  reduction  of  the  size  of 
the  front  (guaranteeing  convergence  of  the  method) 

2.  is  positioned  inside  the  sphere  centered  at  the  best 
position  (with  respect  to  shape)  for  the  fourth  vertex 
of  the  face  to  be  removed  with  a  radius  the  size  of 
the  face  to  be  removed.  This  avoids  the  creation 
of  a  stretched  element  with  respect  to  the  face  in 
consideration. 

3.  any  of  the  three  bounding  vertices  of  the  face  to 
be  removed  are  positioned  inside  the  sphere  of  any 
of  the  partially  connected  mesh  faces  connected  to 
the  target  vertex.  This  allows  for  the  creation  of  a 
stretched  element  with  respect  to  the  face  in  consid¬ 
eration  which  is  not  stretched  with  respect  to  par¬ 
tially  connected  mesh  faces  connected  to  the  target. 

Figure  25  shows  potential  target  vertices  of  type  1,  2, 
and  3  for  the  face  to  remove. 


3 


Figure  25.  The  three  types  of 
potential  target  vertices  (2-d  setting) 


Given  a  potential  target  vertex,  one  has  to  make  sure  that 
any  new  mesh  entity  (resulting  from  the  creation  of  the 
new  mesh  region)  does  not  intersect  an  existing  mesh 
entity.  The  creation  of  a  new  mesh  region  may  result 
in  the  creation  of  a  new  mesh  vertex,  up  to  three  new 
mesh  edges,  and  up  to  three  new  mesh  faces.  New  mesh 
edges  are  checked  for  intersection  against  nearby  partially 
connected  mesh  faces.  Given  a  virtual  new  mesh  edge, 
the  nearby  partially  connected  mesh  faces  are  obtained 
through  the  tree  neighborhood  of  order  0  (of  the  new 
edge).  If  no  intersection  is  detected,  new  mesh  faces  are 
checked  for  intersection  against  nearby  front  mesh  edges. 
Given  a  virtual  new  mesh  face,  nearby  front  mesh  edges 
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are  obtained  through  the  partially  connected  mesh  faces 
in  the  tree  neighborhood  of  order  0  (of  the  new  face). 
Because  any  terminal  octant  knows  about  the  partially 
connected  mesh  faces  in  its  volume,  considering  a  tree 
neighborhood  of  order  0  guarantees  that  no  intersection 
can  be  missed. 

The  closeness  of  the  new  mesh  region  to  existing  mesh 
entities  is  evaluated  by  considering  the  minimum  relative 
distance  between  any  new  mesh  entity  and  nearby  exist¬ 
ing  mesh  entities.  The  relative  distance  is  defined  as  the 
absolute  distance  divided  by  the  average  size  of  the  mesh 
entities  involved.  The  nearby  mesh  entities  are  obtained 
through  a  tree  neighborhood  of  order  1  of  the  new  entity 
being  tested.  It  is  important  to  note  that  nearby  existing 
mesh  entities  in  a  tree  neighborhood  of  order  1  may  not 
be  in  a  tree  neighborhood  of  order  0.  On  the  other  hand, 
nearby  existing  mesh  entities  cannot  be  missed  with  a 
tree  neighborhood  of  order  1.  If  there  is  a  new  vertex, 
distances  between  the  new  vertex  and  nearby  existing  par¬ 
tially  connected  mesh  faces  are  considered.  For  any  new 
mesh  edge,  distances  between  the  new  edge  and  nearby 
existing  front  mesh  edges  are  considered.  If  the  point  on 
the  new  edge  corresponding  to  the  distance  (that  is,  clos¬ 
est  to  the  nearby  existing  front  mesh  edge)  corresponds 
to  an  existing  bounding  mesh  vertex,  the  distance  is  dis¬ 
carded.  In  that  case,  it  means  that  the  nearby  existing 
front  mesh  edge  is  close  to  another  existing  mesh  entity 
and  not  to  a  new  mesh  entity.  Also,  for  any  new  face, 
distances  between  the  new  face  and  nearby  existing  front 
mesh  vertices  are  considered.  Again,  distances  are  dis¬ 
carded  if  the  point  on  the  new  mesh  face  corresponding  to 
the  distance  (that  is,  closest  to  the  nearby  existing  front 
mesh  vertex)  is  actually  on  an  existing  bounding  mesh 
vertex  or  edge.  The  three  different  cases  are  shown  in 
Figure  26.  The  threshold  a  corresponds  to  what  is  con¬ 
sidered  acceptable  in  terms  of  closeness  when  creating  a 
new  element.  Experimentation  led  to  the  use  of  a  value 
of  0.2  for  a. 


New  vertex  vs  New  edge  vs  New  face  vs 

nearby  faces  nearby  edges  nearby  vertices 


Figure  26.  Evaluation  of  relative  minimum  distance 
between  new  entities  and  nearby  existing  mesh  entities 


If  a  new  vertex  needs  to  be  created,  its  location  must 
be  such  that  the  new  element  is  well-shaped,  and  neither 
causes  intersection  nor  is  too  close  to  nearby  existing 
mesh  entities.  The  initial  location  for  the  new  vertex  is  at 
the  position  which  creates  the  best  shaped  element  for  the 
face  to  be  removed.  This  location  is  on  the  perpendicular 
to  the  face  passing  through  the  centroid.  If  the  current 


location  causes  the  new  element  to  intersect  nearby  ex¬ 
isting  mesh  entities,  a  new  location  is  considered  on  the 
normal  half-way  from  the  current  location,  and  so  on,  un¬ 
til  a  valid  location  is  found.  In  order  not  to  be  too  close 
to  existing  mesh  entities,  the  final  location  is  considered 
conservatively  half-way  from  the  current  location. 

Figure  27  graphically  depicts  a  face  removal  in  a  two- 
dimensional  setting.  There  are  four  target  vertices  ordered 
(1,  2,  3,  and  4)  with  respect  to  increasing  shape  measure 
of  the  element  to  be  created.  Target  vertex  1  is  rejected 
since  the  new  element  is  too  close  to  an  existing  mesh 
entity  (vertex  3).  Target  vertex  2  is  rejected  since  the  new 
element  intersects  existing  mesh  entities.  Target  vertex  3 
is  therefore  accepted. 


2  intersection 


Figure  27.  Potential  target  vertices 
and  best  face  removal  (2-d  setting) 


3.4.  Parallel  Constructs  Required 

3.4.1  Octree  and  Mesh  Data  Structures 

The  two  main  data  structures  are  the  mesh  and  octree  data 
structures.  The  mesh  data  structure  (sequential)  and  par¬ 
allel  mesh  data  base  (PMDB)  both  described  above  are 
used  here  to  support  the  presented  mesh  generator.  The 
octree  data  structure  is  on  top  of  the  mesh  data  structure. 
To  gather  a  tree  neighborhood  or  all  terminal  octants  in 
the  path  of  a  mesh  entity  (new  vertex,  edge,  or  face), 
any  processor  must  be  able  to  effectively  determine  to 
which  processor  any  given  terminal  octant  is  assigned. 
This  information  is  easily  available  when  each  proces¬ 
sor  has  full  knowledge  of  the  basic  octree  in  terms  of 
structure  and  processor  assignment.  This  is  the  approach 
currently  implemented.  Although  the  size  of  the  tree  is 
small  compared  to  that  of  the  mesh  and  this  tree  informa¬ 
tion  can  easily  be  copied  to  each  processor,  this  approach 
does  not  scale  indefinitely.  Any  terminal  octant  stores 
links  to  on-processor  partially  connected  mesh  faces  and 
off-processor  partially  connected  mesh  faces  totally  or 
partially  within  its  volume.  Octree  neighboring  informa¬ 
tion  (like  finding  terminal  octants  neighboring  an  octant 
face,  edge,  or  corner)  is  obtained  through  tree  traversals 
(logarithmic  complexity). 

Techniques  that  maintain  only  portions  of  the  tree  on  indi¬ 
vidual  processors  while  providing  tree  neighboring  infor¬ 
mation  efficiently  are  currently  under  investigation.  It  is 
of  interest  to  be  able  to  retrieve  tree  neighboring  informa¬ 
tion  without  having  to  communicate.  If  communication  is 
allowed  during  a  neighboring  information  request,  some 
processors  will  have  to  interrupt  and  be  involved  in  the 
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request,  which  certainly  can  degrade  the  overall  perfor¬ 
mance  if  not  done  carefully.  An  easy  solution  is  to  make 
sure  that  all  processors  participate  in  the  request  (soft 
synchronization).  On  a  sequential  machine,  performing 
tree  traversals  to  obtain  neighboring  information,  typi¬ 
cally,  getting  all  terminal  octants  that  neighbor  an  octant 
entity  (face,  edge,  or  corner)  can  be  avoided  if  octant 
face  neighboring  terminal  octants  are  stored.  The  lim¬ 
ited  increase  in  data  storage  is  well  worth  the  constant 
time  complexity  for  getting  neighboring  information.  In 
a  parallel  setting,  it  is  difficult  to  conceive  such  a  scheme 
without  having  to  communicate  between  processors. 

3.4.2  Multiple  Octant  Migration 

When  the  mesh  generation  process  comes  to  a  point  when 
no  face  removal  can  be  applied  (face  removals  are  not 
applied  when  needed  tree  neighborhoods  are  not  fully 
on  processor),  the  tree  and  associated  mesh  is  reparti¬ 
tioned.  The  migration  of  octants  is  key  to  repartitioning 
once  decisions  concerning  new  destinations  of  terminal 
octants  (classified  boundary)  have  been  made.  Multiple 
octant  migration  itself  relies  on  the  multiple  migration  of 
partially  connected  mesh  faces  and/or  mesh  regions  (de¬ 
scribed  above).  Note  that  multiple  mesh  region  migration 
is  also  used  in  the  final  repartitioning  at  the  region  level 
once  the  mesh  has  been  fully  generated. 

Any  processor  can  send  any  number  of  terminal  octants 
to  another  processor.  When  a  terminal  octant  is  migrated 
from  one  processor  to  another,  the  partially  connected 
mesh  faces  not  connected  to  any  mesh  region  (these  are 
the  mesh  faces  remaining  from  the  given  surface  triangu¬ 
lation)  owned  by  the  octant  and/or  the  mesh  regions  that 
are  bounded  by  at  least  one  partially  connected  mesh  face 
owned  by  the  octant  are  migrated  as  well.  An  octant  owns 
a  mesh  entity  when  it  knows  about  it  (has  it  within  its 
volume)  and  has  its  centroid  within  its  volume.  Note  that 
a  partially  connected  mesh  face  not  known  by  the  octant 
may  be  migrated  as  part  of  a  mesh  region  if  that  region  is 
bounded  by  another  partially  connected  mesh  face  whose 
owner  is  the  octant.  Also,  if  a  mesh  region  is  bounded 
by  more  than  one  partially  connected  mesh  face  known 
to  the  octant  to  be  migrated  (up  to  four),  the  ownership  is 
arbitrarily  dictated  by  the  first  partially  connected  mesh 
face  to  be  processed  (from  the  list  of  partially  connected 
mesh  faces  known  to  the  octant).  Figure  28  shows  a  two- 
dimensional  example  of  the  mesh  regions  to  be  migrated 
within  an  octant.  When  the  multiple  octant  migration 
completes,  the  processor  is  informed  of  the  octants  it  has 
received.  For  each  received  octant,  a  list  of  associated 
mesh  entities  is  also  given,  basically  the  partially  con¬ 
nected  mesh  faces  and/or  mesh  regions  that  were  sent. 
The  primary  complexity  that  arises  when  migrating  oc¬ 
tants  and  associated  mesh  information  is  the  absence  of 
a  global  labeling  system  for  the  mesh  entities.  Each  pro¬ 
cessor  employs  a  local  labeling  for  the  hierarchy  of  mesh 
entities  that  it  is  assigned.  The  interprocessor  mesh  adja¬ 
cency  links  maintain  the  required  knowledge  of  the  adja¬ 
cent  mesh  entities  on  neighboring  processors.  Although 
the  mesh  data  for  a  partially  connected  face  is  on  one 


processor,  the  octants  which  refer  to  that  face  may  be  on 
multiple  processors.  Since  the  face  removal  procedure 
must  perform  geometric  checks  on  all  partially  connected 
faces  known  to  that  octant,  the  time  required  to  perform 
these  operations  would  be  greatly  increased  if  the  required 
information  had  to  be  fetched  from  neighboring  proces¬ 
sors.  To  eliminate  this  requirement,  each  partially  con¬ 
nected  face  known  to  an  octant  will  either  be  a  pointer  to 
face,  when  the  face  is  actually  on-processor,  or  a  set  of 
three  coordinates  when  the  face  is  stored  off-processor. 
Although  this  approach  avoids  interprocessor  communi¬ 
cations,  it  complicates  the  process  of  updating  references 
to  partially  connected  mesh  faces  on  and  off-processor 
when  octants  are  migrated.  Concerning  the  update  of 
processor  assignment  at  the  octant  level,  since  the  tree 
structure  is  currently  stored  on  all  processors,  a  broad¬ 
cast  is  performed  to  all  processors  indicating  the  fact  that 
octants  have  been  relocated. 

3.4.3  Dynamic  Repartitioning 

Dynamic  repartitioning  enables  redistribution  of  the  load 
among  processors  as  evenly  as  possible  at  key  stages  of 
the  mesh  generation  process.  These  key  stages  are: 

1.  at  the  beginning  of  template  meshing, 

2.  at  the  beginning  of  each  face  removal  step,  and 

3.  at  completion  of  the  mesh  generation  process. 

Repartitioning  for  stages  1  and  2  is  done  at  the  terminal 
octant  level  (1  with  respect  to  terminal  octants  classified 
interior  and  2  with  respect  to  terminal  octants  classified 
boundary).  Repartitioning  for  stage  3  is  performed  at 
the  mesh  region  level.  The  strategy  is  identical  for 
both  cases,  only  the  process  of  migrating  differs.  The 
methods  used  here  are  geometry-based  dynamic  balancing 
(repartitioning)  procedures  which  are  described  in  section 
2.3.1. 

3.5.  Parallel  Region  Meshing 
3.5,1  Underlying  Octree 

At  this  point  in  time,  the  octree  is  built  sequentially  oh 
a  single  processor  (processor  0).  Since  a  sequential  oc¬ 
tree  building  can  become  a  bottleneck  when  dealing  with 
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very  large  meshes,  techniques  to  build  the  tree  in  par¬ 
allel  are  currently  considered.  A  distributed  tree  can  be 
constructed  in  parallel  as  long  as  operators  to  subdivide 
and  migrate  octants  are  available.  Octant  migration  guar¬ 
antees  that  the  tree  can  be  well  distributed  at  any  stage 
during  the  building  process,  which  is  important  memory 
wise.  Those  operators  are  key  to  the  problem  since  they 
update  possible  inter-processor  links  in  a  distributed  tree. 

3.5.2  Template  Meshing  of  Interior  Octants 

Once  all  terminal  octants  have  been  properly  classified, 
the  terminal  octants  classified  interior  are  partitioned. 
The  parallel  application  of  templates  is  a  straight  for¬ 
ward  process  in  which  there  is  no  communication  required 
during  the  process  of  creating  the  octant  level  meshes. 
It  should  be  noted  that  the  application  of  templates  to 
octants  sharing  the  same  octant  face  implicitly  lead  to 
the  same  octant  face  triangulation.  The  finite  elements 
generated  in  these  octants  are  loaded  into  the  processor 
mesh  data  structure.  The  interprocessor  communication 
required  at  the  end  of  this  step  is  for  the  updating  of  inter¬ 
processor  mesh  entity  links  for  mesh  entities  created  on 
the  boundaries  of  interior  octants  which  are  on  processor 
boundaries.  The  cost  for  the  application  of  templates  is 
small  compared  to  the  cost  of  performing  face  removals. 
Therefore,  the  parallel  efficiency  of  parallel  region  mesh¬ 
ing  is  dictated  primarily  by  the  face  removal  part  only. 

3.5.3  Face  Removal 

Parallel  face  removal  is  an  iterative  process  where  each 
iteration  consists  of  three  steps: 

1.  Tree  repartitioning  at  the  terminal  octant  (classified 
boundary)  level, 

2.  Face  removal  step,  and 

3.  Reclassification  of  terminal  octants  from  boundary 
to  meaningless 

The  goal  of  step  1  is  to  make  sure  that  all  processors  will 
have  an  equal  amount  of  work  to  perform  during  step  2.  It 
is  difficult  to  predict  how  much  work  or,  more  precisely, 
how  many  face  removals  (step  2)  any  processor  will  per¬ 
form  and  the  total  amount  of  effort  for  a  particular  face 
removal.  However,  a  terminal  octant  classified  boundary 
is  a  good  unit  of  work  load  since  the  set  of  all  terminal 
octants  classified  boundary  approximately  corresponds  to 
the  domain  still  to  be  meshed.  The  difficulty  of  perform¬ 
ing  face  removals  in  parallel  resides  in  the  fact  that  any 
face  removal  requires  the  knowledge  of  tree  neighbor¬ 
hoods.  Tree  neighborhoods  of  order  0  or  1  are  needed  at 
different  steps  of  the  removal  of  a  given  mesh  face.  If,  at 
any  point  during  the  face  removal,  a  tree  neighborhood 
is  not  fully  on-processor,  the  face  removal  is  aborted  and 
the  next  mesh  face  is  considered  for  removal.  Once  all 
possible  face  removals  have  been  performed  on  proces¬ 
sor,  some  terminal  octants  classified  boundary  which  used 
to  know  about  partially  connected  mesh  faces  (on  or  off- 
processor)  are  reclassified  meaningless.  Because  those 
octants  no  longer  cover  any  portion  of  the  domain  still 
to  be  meshed,  they  are  now  useless  (for  the  purpose  of 


face  removals)  and  will  therefore  not  influence  the  next 
repartitioning. 

Figure  29  depicts  the  first  iteration  on  a  simplistic  exam¬ 
ple.  In  the  left-side  picture,  terminal  octants  classified 
boundary  have  been  partitioned  and  each  of  them  is  as¬ 
signed  to  a  processor  (0  to  3).  The  right-hand  side  picture 
shows  the  current  mesh  after  all  possible  face  removals 
have  been  performed  on  processors.  Shaded  areas  repre¬ 
sent  the  domain  still  to  be  meshed. 


Figure  29.  Parallel  face  removal  (2-d  setting) 


The  process  of  performing  face  removals  and  repartition¬ 
ing  the  tree  continues  until  there  are  no  more  partially 
connected  mesh  faces  in  the  mesh.  Define  the  efficiency 
of  the  face  removal  stage  as  being  the  ratio  of  the  number 
of  performed  face  removals  to  the  number  of  attempted 
face  removals.  After  a  few  iterations,  the  efficiency  of 
the  face  removal  stage  can  be  very  low  because  informa¬ 
tion  required  to  perform  face  removals  is  almost  always 
off-processor.  When  more  than  half  of  the  processors 
have  an  efficiency  below  some  given  threshold  (25%), 
the  processor  set  is  reduced  (by  half). 

Since  migration  of  terminal  octants  only  deals  with  those 
classified  boundary  and  only  worries  about  mesh  regions 
bounded  by  partially  connected  mesh  faces,  it  is  very 
likely  that  the  final  mesh  will  be  scattered  across  proces¬ 
sors  with  no  real  structure.  It  is  therefore  necessary  to 
repartition  in  parallel  the  distributed  mesh  using  IRB  at 
the  mesh  region  level  with  the  original  full  set  of  proces¬ 
sors.  Figure  30  shows  the  whole  process  of  parallel  face 
removal  on  four  processors.  The  first  8  pictures  display 
the  currently  partially  connected  mesh  faces  after  the  ter¬ 
minal  octants  classified  boundary  have  been  repartitioned. 
Note  that  iterations  1,  2,  3,  and  4  use  all  four  processors, 
iterations  5,  6,  and  7  use  two  processors,  and  iteration  8 
uses  one  processor.  The  final  picture  displays  the  final 
three-dimensional  repartitioned  mesh  on  four  processors. 
Tables  4  and  5  show  speed-ups  for  up  to  four  processors 
for  the  connecting  rod  and  blade  models,  respectively  (fi¬ 
nal  repartitioned  meshes  on  four  processors  are  shown  in 
Fig.  31  and  Fig.  32,  respectively).  Tables  6,  7  and  8 
show  speed-ups  for  up  to  eight  processors  for  the  onera 
wing,  mechanical  part,  and  mechanical  part  2  models, 
respectively  (final  repartitioned  meshes  on  eight  proces¬ 
sors  are  shown  in  Fig.  33,  34,  and  Fig.  35,  respectively). 
The  number  of  mesh  regions  created  indicated  in  the  cap¬ 
tions  corresponds  to  parallel  face  removal  only  and  does 
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Iteration  1 
4/4  procs 


Iteration  3 
4/4  procs 


Iteration  2 
4/4  procs 


Iteration  4 
4/4  procs 


Iteration  5  iteration  6 

2/4  procs  2/4  procs 


Iteration  7 
2/4  procs 


iteration  8 
1/4  procs 


Final  mesh 

Figure  30.  Successive  face  removal  iterations 
and  final  repartitioned  mesh  for  chicklet 


Procs 

1 

2 

4 

Iterations 

1 

5 

7 

Face 

removal 

speedup 

1.0 

1.9 

3.3 

Total 

speedup 

1.0 

1.8 

2.9 

Table  4  Face  removal  statistics  for 
connecting  rod  (35,000  mesh  regions 
created  by  face  removals  —  70,000  total) 


Figure  31.  Final  repartitioned  mesh 
for  connecting  rod  (4  processors) 


Procs 

1 

2 

4 

Iterations 

1 

5 

8 

Face 

removal 

speedup 

1.0 

2.0 

3.2 

Total 

speedup 

1.0 

1.9 

2.8 

Table  5  Face  removal  statistics  for  blade  (60,000  mesh 
regions  created  by  face  removals  —  90,000  total) 


not  include  template  meshing.  Face  removal  speed-up  in¬ 
dicates  speed-up  for  step  2  of  the  parallel  face  removal 
procedure.  Total  speed-up  indicates  speed-up  for  all  steps 
(1,  2,  and  3).  In  that  case,  the  first  repartitioning  (itera¬ 
tion  1)  is  not  counted  since  it  can  be  considered  an  initial 
partitioning  step.  Note  that  the  time  taken  to  perform  the 
first  repartitioning  depends  on  the  size  of  the  problem  and 
not  the  number  of  processors.  The  speed-up  is  by  defi¬ 
nition  set  to  1.0  for  the  run  with  the  smallest  number  of 
processors.  The  results  show  good  speed-ups  as  long  as 
the  size  of  the  problem  is  adequate  with  the  number  of 
processors  on  hand. 


Figure  33.  Final  repartitioned  mesh 
for  omra  wing  (8  processors) 

4.  Parallel  Mesh  Enrichment 

4.1.  Local  Retriangulation  Tools 

Local  retriangulation  techniques  have  been  used  to  trans¬ 
form  locally  non-Delaunay  triangulations  of  a  set  of 


Table  8  Face  removal  statistics  for 
mechanical  part  2  (125,000  mesh  regions 
created  by  face  removals  —  240,000  total) 

points  into  Delaunay  triangulations  [35],  generate  Geo¬ 
metric  triangulations  of  models  with  faceted  boundaries 
[28]  (boundary  recovery),  optimize  existing  triangulations 
[25,  17],  etc.  The  local  retriangulation  tools  presented  in 
this  section  do  not  delete  or  create  vertices.  The  mesh 
entity  splitting  presented  in  the  refinement  section  creates 
a  vertex.  The  edge  collapsing  presented  in  the  derefine¬ 
ment  section  deletes  a  vertex.  Local  retriangulation  tools 
are  used  here  to  optimize  triangulations  (locally  or  glob¬ 
ally)  and  to  help  in  “snapping”  refinement  mesh  vertices 
to  the  model  boundary  (if  required). 
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Figure  35.  Final  repartitioned  mesh 
for  mechanical  part  2  (8  processors) 

A  few  definitions  related  to  triangulation  quality  relevant 
to  local  retriangulation  tools  are  now  given: 
Triangulation  quality:  If  each  mesh  entity  ^Tf'  of  a  tri¬ 
angulation  ^0  is  associated  with  a  quality  measure  Qi, 
the  quality  of  the  triangulation  is  defined  as  Q  =  min  {qi) 
Triangulation  acceptability:  Given  a  quality  threshold  qi, 
a  triangulation  is  acceptable  with  respect  to  triangu¬ 
lation  quality  if  Q  >  qi. 

Triangulation  comparison:  A  triangulation  ^rti  of  a  set 
of  points  is  considered  better  with  respect  to  triangulation 
quality  than  another  triangulation  of  the  same  set  of 
points  if  Qi  >  Qj- 

4.1.1  Edge  Swapping 

In  two  and  three  dimensions,  a  swapping  step  is  per¬ 
formed  after  inserting  a  new  node  into  the  triangulation 
to  transform  a  locally  non-Delaunay  triangulation  into  a 
Delaunay  one.  Aside  from  the  refinement  issue,  it  is  a 
method  to  incrementally  build  a  Delaunay  triangulation 
of  a  set  of  points. 

Swapping  relies  on  the  general  result  given  by  Lawson 
which  states  that  a  set  of  n+2  points  in  i?”  may  be 
triangulated  in  at  most  two  ways  [42].  In  two  dimen¬ 
sions,  there  are  two  ways  to  triangulate  a  strictly  con¬ 
vex  quadrilateral.  Edge  swapping  consists  of  switching 
diagonals  for  the  quadrilateral  resulting  from  the  union 
of  the  two  connected  triangles  (if  convex).  In  three  di¬ 
mensions,  there  are  two  ways  to  triangulate  a  strictly 
convex  triangular  hexahedron  containing  five  and  only 
five  points  (the  five  apices  of  the  triangular  hexahedron). 
Joe  provided  a  set  of  workable  swappable  configura¬ 
tions  for  the  three-dimensional  case  [35].  If  a  mesh  face 
is  not  locally  optimal  (does  not  satisfy  the  Delau¬ 
nay  criterion)  and  corresponds  to  one  of  the  two  situ¬ 
ations  on  the  left  side  of  figure  36,  it  is  swapped.  If 
[^r°,  „T“]  ^  0  ([^r°,  being  the  line  seg¬ 


ment  spanning  from  to  the  triangular  hexahe¬ 
dron  initially  containing  two  tetrahedra  is  retriangulated 
with  three.  If  [raT°,  nT^]  n  {Si  U  U  Ss)  7^  0  (where 
the  Si’s  are  plane  sectors  appearing  shaded  in  figure  36) 
and  3  |  {„T“,  n.T’O}  6  d{^Tf),  the 

triangular  hexahedron  initially  containing  three  tetrahe¬ 
dra  is  retriangulated  with  two. 


Figure  36.  2-to-3  and  3-to-2  swaps  in  three  dimensions 


These  swaps,  commonly  referred  to  as  2-to-3  and 
3-to-2,  are  suited  for  Delaunay  triangulations  and  by 
extension  for  regular  triangulations  [19].  Referring 
to  Fig.  36,  if  n  (5iU52U53)  ^  0 

and  ^  or 

n  u  5i  U  52  U  53)  =  0,  there  is 
no  possible  swap.  When  dealing  with  Delaunay  tri¬ 
angulations  (or  regular  triangulations),  theoretical  re¬ 
sults  indicate  that  non  swappable  faces  (in  Joe’s 
sense)  are  not  critical.  However,  when  dealing  with 
any  other  criterion,  non  swappable  faces  (in  Joe’s 
sense)  may  be  critical.  The  other  non  swappable 
configuration  from  figure  36  which  corresponds  to 
[^T°,  n  (^T2  US1US2US,)  =3  0  consists  of 
four  tetrahedra  bounded  by 

f  7^0  7^0  7^0  t0\  /  tO  T^\  onH 

respectively  [35].  It  is  clear 
that  there  is  no  other  way  of  triangulating  this  convex 
hull.  The  ideas  presented  by  Briere  de  I’Isle  and  George 
[17]  about  edge  removal  enable  the  extension  of  the 
classic  3-to-2  swap  [35,  19]. 


4.1.2  Edge  Removal 

Briere  de  I’lsle  and  George  [17]  have  proposed  an  edge 
removal  technique  as  part  of  an  algorithm  to  optimize  the 
quality  of  a  given  mesh.  It  can  also  be  used  as  part  of  a 
scheme  to  recover  the  faceted  boundary  of  a  model  [28]. 
A  mesh  edge  C  gTi  which  is  bounded  by  vertices 
and  can  be  eliminated  by  retriangulating  the 
polyhedron  of  all  connected  tetrahedrons.  The  polyhe¬ 
dron  is  retriangulated  by:  (i)  triangulating  the  polygon 
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of  all  mesh  vertices  of  the  polyhedron  which  are  neither 
nor  and  (ii)  connecting  the  new  mesh  faces 
to  and  (Fig.  37). 


Figure  37.  Edge  removal 

If  the  polyhedron  originally  consisted  of  m  tetrahedra,  the 
associated  polygon  has  m  sides  and  m  apices.  The  number 

771 

of  possible  retriangulations  is  Nm  =  Ni-\Nm+2-i 

i=3 

with  A^2  =  2  [27,  17].  Clearly,  if  the  polyhedron  is  not 
convex,  some  possible  retriangulations  have  to  be  dis¬ 
carded.  The  number  of  different  triangles  when  consider¬ 
ing  all  retriangulations  is  NTm  =  m{m  —  l){m  —  2)/6 
[17].  Table  9  shows  computed  values  of  Nm  and  NTm 
for  m  =  3  to  9  [17]. 


m 

3 

4 

5 

6 

7 

8 

9 

Nm 

1 

2 

5 

14 

42 

132 

429 

NTm 

1 

4 

10 

20 

35 

56 

84 

Table  9  Number  of  possible  retriangulations  and  different 
triangles  as  m  (number  of  connected  tetrahedra)  increases 


Since  Nm  grows  rapidly,  Briere  de  ITsle  and  George  have 
chosen  m  =  9  as  an  upper  limit  for  their  edge  removal 
scheme.  It  should  be  noted  that  those  retriangulations 
represent  all  possible  triangulations  of  the  polyhedron  that 
do  not  have  m'I'i  t>ut  are  only  a  subset  of  all  possible 
retriangulations  of  the  polyhedron. 

Given  a  mesh  edge  m^i  connected  to  more  than  one 
mesh  region,  edge  removal  consists  of  retriangulating  the 
polyhedron  poZ(„T/)  of  all  mesh  regions  connected  to 
in  such  a  way  that  is  not  present  in  the  retrian¬ 
gulation  (Fig.  37).  When  edge  removal  is  topologically 
possible,  the  mesh  edge  is  said  to  be  topologically  re¬ 
movable.  An  edge  removal  is  positive  (negative)  if  the 
retriangulation  of  the  polyhedron  is  better  (worse,  respec¬ 
tively)  than  the  original  triangulation,  in  other  words,  the 
variation  of  the  local  triangulation  quality  is  positive  (neg¬ 
ative,  respectively).  A  brief  description  of  the  algorithm 
to  remove  an  edge  in  the  context  of  Geometric  triangu¬ 
lation  optimization  follows  [17]: 

1.  Determine  quality  Qorg  of  triangulation  of 
polim^l) 


2.  Get  the  associated  polygon  as  an  ordered  list  of 
vertices 

3.  Consider  among  all  possible  retriangulations  of 
pol[mTi)  those  that  are  better  than  the  original 
one  and  keep  track  of  highest  (max{Qnew)) 

4.  If  max{Qnew)  exists: 

a.  Delete  all  mesh  regions  connected  to  to 
form  a  polyhedral  cavity 

b.  Retriangulate  such  that  new  quality  is 
TTldxi^Q 

The  initial  triangulation  of  the  polyhedron  for  a  mesh 
edge  classified  in  model  region  is  such  that  there  are: 

1.  m  mesh  regions, 

2.  m  interior  (with  respect  to  the  polyhedron)  mesh 
faces,  and 

3.  1  interior  mesh  edge  connected  to  m  mesh  regions. 
The  resulting  triangulation  is  such  that  there  are: 

1.  2m-4  mesh  regions, 

2.  m-2  interior  mesh  faces,  and 

3.  m-3  interior  mesh  edges  each  connected  to  4  mesh 
regions. 

A  local  retriangulation  tool  like  edge  removal  is  typically 
used  to  remove  an  undesirable  mesh  region  from  a  tri¬ 
angulation.  Since  a  mesh  region  has  six  edges,  there  are 
six  possibilities  to  remove  the  mesh  region  using  edge 
removals.  It  is  sometimes  of  interest  to  have  more  ways 
to  remove  that  mesh  region.  Beside  edge  collapsing  and 
mesh  entity  splitting,  the  procedure  that  reverses  the  edge 
removal  process  can  be  used  to  attempt  to  remove  the 
mesh  region.  This  new  procedure  is  described  in  the  next 
section  and  is  called  multi-face  removal. 

4.1.3  Multi-Face  Removal 

Multi-face  removal  is  a  procedure  that  reverses  edge 
removal,  in  other  words,  it  considers  a  configuration  that 
could  have  resulted  from  edge  removal  and  obtain  the 
starting  configuration.  When  applied  to  a  single  mesh 
face,  it  is  the  classic  2-to-3  swap  [35,  19]. 

Given  a  simply  connected  set  of  mesh  faces  {^Tf} 
such  that  any  mesh  face  in  the  set  connects  (through 
a  mesh  region)  to  m'^i  o”  o”®  0*^  l*’® 

other  side,  the  polyhedron  poZ({^T?})  is  defined  by  the 
union  of  all  mesh  regions  that  connect  to  a  mesh  face 
in  [mTf}.  Multi-face  removal  retriangulates  the  polyhe¬ 
dron  pol  ( {  } )  such  that  all  mesh  faces  in  {  }  are 

removed.  As  for  the  edge  removal,  a  multi-face  removal 
is  positive  (negative)  if  the  new  triangulation  is  better 
(worse,  respectively)  than  the  original  one.  Multi-face 
removal  is  topologically  possible  if  (i)  the  deletion  of  all 
mesh  regions  in  pol  ( {  } )  does  not  lead  to  the  deletion 

of  a  mesh  vertex,  and  (ii)  the  set  of  mesh  edges  peripheral 
to  {m'^i}  constitutes  a  single  loop  that  does  not  touch 
itself.  Figure  38  illustrates  cases  of  multi-face  removals 
that  are  not  topologically  possible.  When  a  multi-face 
removal  is  topologically  possible,  the  set  of  mesh  faces 
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Vertex  One  loop 

disconnection  Two  loops  touches  itself 

Figure  38.  Cases  when  the  topological  state  of  the  set 
of  mesh  faces  }  prevents  multi-face  removal 


{mTi}  (as  well  as  any  mesh  face  in  the  set)  is  said  to 
be  topologically  removable. 

Since  the  goal  of  the  presented  optimization  algorithm  is 
to  get  rid  of  undesirable  mesh  regions,  the  input  to  the 
multi-face  removal  procedure  is  a  mesh  region  and 
a  bounding  mesh  face  from  which  the  simply  con¬ 
nected  set  of  mesh  faces  is  constructed.  The  description 
of  the  algorithm  follows: 

1.  Get  vertex  opposite  in  (Fig  39.a) 

2.  Get  region  on  other  side  of 

3.  Get  the  vertex  opposite  in  (Fig  39.b) 

4.  Gather  all  pairs  of  face-connected  mesh  regions  such 

that  one  mesh  region  connects  to  and  the  other 
connects  to  .  Keep  track  of  the  mesh  faces  in- 
between  pairs  of  mesh  regions  The  set 

of  gathered  mesh  regions  defines  pol{{^Ti})  (Fig 
39.C) 

5.  If  retriangulation  would  create  invalid  elements,  do 
not  perform  removal 

6.  Compute  quality  of  initial  triangulation  Qorg 

7.  Compute  quality  Qnew  of  triangulation  that  would 
result  from  connecting  all  boundary  faces  of 
pol{{^Tl})  to 

8.  If  Qnew  <  Qorg,  do  not  perform  removal 

9.  Delete  the  mesh  regions  in  to  form  a 

polyhedral  cavity 

10.  Connect  all  faces  of  polyhedral  cavity  to  (Pig 
39.d) 

The  initial  triangulation  of  the  polyhedron  (Fig  39. c)  is 
such  that  there  are: 

1 .  m  mesh  regions  (note  that  m  is  an  even  number), 

2.  3m/2-2  interior  (with  respect  to  the  polyhedron) 
mesh  faces,  and 

3.  m/2-/  interior  mesh  edges  each  connected  to  4  mesh 
regions. 

The  resulting  triangulation  (Fig  39.d)  is  such  that  there 
are: 

1.  m/2+2  mesh  regions, 

2.  m/2+2  interior  mesh  faces,  and 

3.  1  interior  mesh  edge  connected  to  m/2+2  mesh  re¬ 
gions. 


m  tets 

3m/2  -  2  int.  faces 
m/2  - 1  int.  edges 

Figure  39.  Multi-face  removal  in  three  dimensions 

4.1.4  Triangulation  Optimization  Using 
Local  Retriangulation  Tools 

The  goal  of  the  optimization  algorithm  is  to  improve 
the  quality  of  Geometric  triangulations  with  respect  to 
a  given  criterion  (e.g.,  element  shape).  The  optimization 
procedure  described  here  makes  use  of  the  local  retri¬ 
angulation  tools  described  above,  namely  edge  removal 
and  multi-face  removal.  Other  local  retriangulation  tools 
which  change  the  number  of  mesh  vertices  like  mesh  en¬ 
tity  splitting,  edge  collapsing,  and  even  local  remeshing 
are  not  incorporated  into  this  specific  optimization  proce¬ 
dure.  Also,  smoothing  techniques  (vertex  repositioning) 
[23,  14]  are  not  addressed.  In  this  discussion,  triangula¬ 
tion  optimization  can  be  used  over  the  whole  triangulation 
or  locally  over  a  sub-triangulation  resulting  from  adaptive 
enrichments  such  as  refinement  and  derefinement. 

The  optimization  procedure  is  region  based,  that  is,  it 
looks  for  mesh  regions  that  are  not  acceptable  (quality 
below  qi)  and  attempts  to  remove  them  from  the  trian¬ 
gulation  with  local  retriangulation  tools.  Given  a  non 
acceptable  mesh  region  ,  one  can  potentially  remove 
that  mesh  region  from  the  triangulation  by  considering 
edge  removal  with  respect  to  any  of  its  four  bounding 
edges  or  multi-face  removal  with  respect  to  any  of  its 
four  bounding  faces.  The  optimization  algorithm  is  de¬ 
scribed  as  follows: 

1 .  Initialize  queue  Qu  of  non  acceptable  mesh  regions 
(quality  below  qi) 

2.  If  {Qu  empty)  or  (there  is  no  edge  removal  or  multi¬ 
face  removal  that  can  successfully  be  applied  to  any 
mesh  region  in  Qu),  end 

3.  Pop  a  region  from  Qu 

4.  Consider  which  edge  removal  (with  respect  to  any 
bounding  edge)  or  multi-face  removal  (with  respect 
to  any  bounding  face)  gives  the  best  quality  improve¬ 
ment  of  the  corresponding  polyhedron 


m/2  +  2  tets 
m/2  +  2  int.  faces 
1  int.  edge 
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5.  If  either  an  edge  removal  or  a  multi-face  removal 

has  been  performed: 

a.  Remove  from  Qu  any  deleted  region 

b.  Enqueue  any  new  non  acceptable  mesh  region 

Else  (re)enqueue  mesh  region 

6.  Goto  step  2 

The  process  terminates  when  either  the  queue  is  empty 
or  no  local  retriangulation  can  be  applied  to  any  mesh 
region  in  the  queue.  It  terminates  in  a  finite  number  of 
steps.  This  is  easily  proven  by  examining  the  criterion  for 
locally  retriangulating.  A  domain  is  locally  retriangulated 
only  if  the  quality  of  the  new  triangulation  of  the  domain 
is  strictly  greater  than  the  original  one.  Assume  the 
above  process  does  not  terminate,  the  quality  of  the  global 
triangulation  would  improve  indefinitely  which  cannot  be. 
If  the  queue  is  empty  when  the  program  terminates,  the 
resulting  geometric  triangulation  is  acceptable  and  the 
goal  of  the  optimization  procedure  has  been  met.  If 
the  queue  is  not  empty  when  the  procedure  terminates, 
neither  local  transformation  procedure  (edge  removal  or 
multi-face  removal)  could  be  applied  to  any  of  the  mesh 
regions  in  the  queue. 

The  optimization  of  a  triangulation  using  local  retriangu¬ 
lation  techniques  leads  to  a  local  optimum.  Depending  in 
which  order  the  local  transformation  procedures  are  ap¬ 
plied,  different  local  optima  can  be  reached.  Also,  local 
retriangulation  procedures  may  have  to  be  applied  even 
if  they  are  negative.  It  is  impossible  to  say  whether  those 
local  optima  are  far  or  close  to  the  global  optimum.  It 
is  conjectured  that  a  global  optimum  cannot  be  reached 
with  local  retriangulation  techniques.  However,  in  prac¬ 
tice,  these  local  retriangulation  techniques  often  improve 
the  quality  of  a  triangulation. 

4.2.  Refinement 

Refinement  algorithms  have  been  decomposed  into  three 
groups,  depending  on  which  technique  they  are  based: 
i)  subdivision  patterns  [2,  6,  57,  48,  9,  38],  ii)  bisection 
(generalized  [59,  60,  45,  44]  and  alternate  [3]),  and  iii) 
insertion  in  a  Delaunay  context  [87]  or  by  mesh  entity 
splitting  [53,  30,  46].  The  following  sections  describe 
these  known  schemes  and  introduce  a  new  procedure 
which  considers  a  full  set  of  subdivision  patterns,  there¬ 
fore  allowing  the  possibility  of  no  over-refinement.  A 
set  of  definitions  is  given  prior  to  the  description  of  the 
refinement  algorithms: 

Conformity:  An  n-dimensional  triangulation  is  con¬ 
forming  if  the  intersection  of  any  two  non  disjoint  ele¬ 
ments  is  a  common  d-dimensional  geometric  entity  with 
0  <  d  <  n.  It  is  assumed  here  that  conformity  is  a 
requirement.  Figure  40  illustrates  the  definition  with  a 
two-dimensional  example. 

Triangulation  refinement  sequence:  The  ordered  set 
m^tv}  is  a  triangulation  refinement  se¬ 
quence  if^ie  [1,  iV  -  1]  m^i+i  is  obtained  by  selec¬ 
tively  refining 

Nesting:  A  triangulation  is  nested  into  a  triangu¬ 
lation  if  any  element  of  is  fully  inside  one 


element  of 

Refinement  stability:  A  refinement  scheme  is  stable  if 
all  interior  angles  of  all  triangulations  in  the  sequence 
{^fli ,  •  1  m^JV  }  are  bounded  from  below  and  above 

as  N  goes  to  infinity. 


Non-conforming 

Figure  40.  Non-conforming  and  conforming 
triangulations  in  two  dimensions 

4.2.1  Subdivision  Patterns 

In  the  two-dimensional  case,  two  subdivision  patterns  are 
commonly  used:  i)  regular  1:4  (each  child  triangle  is  sim¬ 
ilar  to  the  parent)  and  ii)  “green”  1:2  (Fig.  41).  Bank  and 
Sherman  [2]  use  a  1:4  subdivision  scheme  to  refine  ele¬ 
ments.  Any  element  with  two  or  three  non-conforming 
vertices  is  1:4  subdivided  (iteratively).  At  this  stage,  all 
elements  can  not  have  more  than  one  non-conforming  ver¬ 
tex.  A  clean-up  phase  which  “green”  subdivides  any  re¬ 
maining  non-conforming  element  completes  the  process. 
For  the  next  refinement  iteration,  if  an  element  resulting 
from  a  “green”  subdivision  is  marked  for  refinement,  the 
parent  element  is  reinstated  and  1 :4  subdivided  (Fig.  42). 
This  ensures  an  angle  is  not  divided  more  than  once. 

A 

Regular  1 :4 

Figure  41.  Classic  element 
subdivision  patterns  in  two  dimensions 

In  three  dimensions,  given  a  mesh  region,  subdivision 
patterns  are  applied  depending  on  the  number  of  marked 
edges.  The  set  of  available  subdivision  patterns  varies. 
Biswas  and  Strawn  [6],  Rausch  et  Al.  [57],  and  Lbhner 
and  Baum  [48]  have  the  1:2,  1:4,  and  1:8  subdivision 
schemes  (Fig.  43).  Bornemann  et  Al.  [9]  have  the  1:2 
(“green  I”),  “green  II”,  1:4  (“green  III”),  and  1:8  subdi¬ 
vision  schemes.  The  “green  II”  scheme  corresponds  to 
the  case  where  there  are  2  non-conforming  edges  for  the 
element.  Kallinderis  and  Vijayan  [38]  use  the  1:2,  1:4, 
1:8,  and  a  centroidal  node  subdivision  schemes.  In  the 
centroidal  node  subdivision  scheme,  a  vertex  is  created 
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Figure  42.  Reinstatement  of  parent  element 
followed  by  regular  subdivision  in  two  dimensions 


Figure  43.  Classic  element  subdivision 
patterns  in  three  dimensions 


at  the  centroid  of  the  element  and  the  element  is  split 
accordingly.  If  a  marking  pattern  does  not  correspond  to 
a  predefined  configuration,  it  is  upgraded  to  the  closest 
one.  The  process  terminates  in  a  finite  number  of  steps. 
It  has  been  shown  that  the  regular  1:8  subdivision  scheme 
is  stable  as  long  as  the  proper  (shortest)  inner  diagonal 
is  chosen  [29,  5].  Algorithms  based  on  subdivision  pat¬ 
terns  are  stable  if  irregular  child  elements  (not  resulting 
from  regular  subdivision)  are  never  further  subdivided, 
in  other  words,  parents  of  those  are  reinstated  and  subdi¬ 
vided  with  the  1:8  subdivision  scheme  prior  to  any  further 
subdivision  [57,  48,  9].  Note  that  is  always  nested 
into  but  m^i+i  may  not  be  nested  into  ^fti  due 
to  the  possible  reinstatement  of  parents  (i  >  2).  Any 
refinement  scheme  based  on  subdivision  patterns  which 
does  not  have  all  possible  subdivision  patterns  and/or  re¬ 
instates  some  parent  elements  prior  to  further  subdivision 
will  in  general  over-refine,  that  is,  produce  more  refine¬ 
ment  than  requested  by  the  adaptive  procedure.  Also, 
using  subdivision  patterns  which  add  a  centroidal  vertex 
when  not  actually  needed  will  over-refine  as  well. 

4.2.2  Generalized  Bisection 

In  two  dimensions,  an  element  is  refined  by  bisecting 
its  longest  edge  (two-triangle  algorithm)  [59].  Elements 
with  non-conforming  edges  are  subdivided  following  the 
patterns  of  Figure  44.  The  process  terminates  in  a  finite 
number  of  steps.  Following  the  results  of  Rosenberg  and 
Stenger  [61]  and  Stynes  [82]  about  longest  edge  bisection, 
the  scheme  is  stable,  furthermore,  interior  angles  are 
always  greater  than  one  half  of  the  lowest  angle  in  the 
initial  triangulation  [59]. 


1  non-conf.  vert.  2  non-conf.  vert.  3  non-conf.  vert. 


Figure  44.  Non-conforming  elements  and 
their  triangulations  in  two  dimensions 

This  method  of  subdivision  along  the  longest  edge  has 
been  extended  to  three  dimensions  [60].  Elements 
to  be  refined  are  bisected  along  their  longest  edges. 
Non-conforming  elements  are  subdivided  along  their 
longest  edges  in  a  recursive  fashion.  Unlike  the  two- 
dimensional  case,  an  element  that  needs  refinement  or  is 
non-confoming  must  be  bisected  at  its  longest  edge. 

This  scheme  guarantees  nesting.  In  two  dimensions, 
following  the  longest  edge  bisection  results  of  Rosenberg 
and  Stenger  [61]  and  Stynes  [82],  the  scheme  is  stable, 
furthermore,  interior  angles  are  always  greater  than  one 
half  of  the  lowest  angle  in  the  initial  triangulation 
[59].  In  three  dimensions,  to  this  point  in  time,  no 
one  has  yet  presented  a  proof  of  the  stability  of  the 
scheme  probably  because  (i)  the  longest  edge  in  a  mesh 
region  is  not  necessarily  opposite  the  largest  dihedral 
angle  and  (ii)  the  sum  of  all  dihedral  angles  of  a  mesh 
region  is  not  constant.  However,  the  scheme  seems  to  be 
“experimentally”  stable.  Because  the  non-conformity  can 
propagate,  this  scheme  will  in  general  over-refine. 

Joe  [45]  has  proven  that  the  infinite  bisection  of  a  tetra¬ 
hedron  is  stable  using  generalized  bisection  on  a  mapped 
special  tetrahedron.  Note  that  this  result  does  not  prove 
that  generalized  bisection  in  the  real  space  is  stable.  Liu 
and  Joe  [44]  have  presented  a  stable  refinement  algorithm 
that  makes  use  of  this  result.  In  for  each  element, 
a  bisected  edge  is  uniquely  chosen  (this  does  not  mean 
that  all  elements  will  be  subdivided).  Elements  that  need 
to  be  subdivided  are  bisected  along  their  bisected  edges. 
When  an  element  is  subdivided  into  two  elements,  the 
bisected  edges  for  the  two  new  elements  are  imposed  ac¬ 
cording  to  rules  given  in  [44].  Once  all  elements  that  need 
refinement  have  been  subdivided,  there  may  be  some  non- 
conforming  elements  in  the  triangulation.  The  process  of 
subdividing  elements  continues  until  there  are  no  more 
non-conforming  elements  in  the  mesh.  At  this  point,  the 
scheme  guarantees  nesting,  is  stable,  and  will  in  general 
over-refine.  After  all  levels  of  refinement  have  been  ap¬ 
plied,  local  transformations  [35]  are  applied  to  further  im¬ 
prove  the  quality  of  the  final  mesh.  It  should  be  noted  that 
if  local  transformations  are  applied  after  each  refinement 
iteration,  a  priori  control  of  stability  is  lost.  From  ex¬ 
perimental  results  given  in  [44],  this  scheme  over-refines 
less  than  the  scheme  by  Rivara  and  Levin  [60]  especially 
as  the  number  of  refinement  levels  becomes  high.  As  a 
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remark,  this  scheme  appears  very  similar  to  the  alternate 
bisection  scheme  [3]  presented  in  the  next  section. 

4.2.3  Alternate  Bisection 

This  approach  has  been  presented  by  Bansch  [3].  In 
refinement  edges  are  chosen  for  each  element  (a 
good  choice  is  the  longest  edge).  Note  that  choosing 
a  refinement  edge  for  each  element  does  not  mean  that 
all  elements  will  be  subdivided.  Elements  that  need  to 
be  subdivided  are  bisected  along  their  refinement  edges. 
When  an  element  is  subdivided  into  two  elements,  the 
refinement  edges  are  topologically  imposed  on  the  two 
new  elements  according  to  Fig.  45.  A  conforming  step 
subdivides  elements  with  non-conforming  edges. 


Figure  45.  Alternate  bisection  in  two  dimensions 


Figure  46.  Watson’s  algorithm  in  two  dimensions 


Region  split 


1:4 


Face  split 


m:3m  (m  =  1  or  2) 
m  =  nbr  of 
(face)cncted  tets 


Edge  split 


m:2m 
m  =  nbr  of 
(edge)cncted  tets 


Figure  47.  Mesh  entity  splitting  in  three  dimensions 


This  scheme  extends  to  three  dimensions.  In  each 
face  is  given  a  refinement  edge  (e.g.,  longest  edge).  For 
each  element  in  mfli,  it  is  then  assumed  there  is  at  least 
one  common  refinement  edge  for  two  adjacent  faces, 
called  a  global  refinement  edge.  In  particular,  this  as¬ 
sumption  holds  if  the  longest  edge  of  each  face  in 
is  chosen  as  the  refinement  edge.  Each  element  to  be 
refined  is  bisected  along  its  global  refinement  edge.  The 
refinement  edges  on  the  four  new  faces  resulting  from 
bisection  of  the  two  parent  faces  are  imposed  as  in  the 
two-dimensional  case  and  the  refinement  edge  on  the  in¬ 
terior  face  is  chosen  according  to  rules  given  in  [3].  The 
two  new  elements  are  then  guaranteed  to  have  a  global 
refinement  edge.  Elements  with  non-conforming  edges 
are  bisected  until  no  non-conforming  elements  remain. 
This  scheme  guarantees  nesting  and  is  stable  [69,  3]. 
Because  the  non-conformity  can  propagate,  this  scheme 
will  in  general  over-refine. 

4.2.4  Delaunay  Insertion 

Inserting  a  new  node  into  a  Delaunay  triangulation  can 
be  done  using,  for  example,  Watson’s  algorithm  [87]. 
All  elements  which  contain  the  new  node  (in  terms  of 
circumcircle  in  two  dimensions  or  circumsphere  in  three) 
are  deleted  to  form  a  point-convex  polyhedral  cavity. 
New  elements  are  created  by  connecting  the  boundary 
of  the  cavity  to  the  new  node  (Fig.  46).  The  new 
triangulation  is  guaranteed  to  be  Delaunay. 

4.2.5  Splitting 

The  insertion  of  a  point  into  a  triangulation  can  be  done 
by  splitting  the  mesh  entities  the  new  point  falls  on. 
In  three  dimensions,  a  mesh  region  can  be  split  into 
four  new  regions,  a  mesh  face  into  three  new  faces, 
and  a  mesh  edge  into  two  new  edges.  Mesh  entity 


splitting  retriangulates  a  polyhedron  by  adding  a  vertex 
and  connecting  the  boundary  faces  of  the  polyhedron 
to  the  new  vertex.  In  the  case  of  a  mesh  region,  the 
polyhedron  is  the  mesh  region  itself.  In  the  case  of  a 
mesh  face  or  a  mesh  edge,  the  polyhedron  is  built  from 
the  union  of  all  mesh  regions  connected  to  the  face  or 
edge,  respectively.  Figure  47  displays  the  three  types  of 
split  in  three  dimensions  and  indicates  for  each  one  of 
them  the  change  in  number  of  mesh  regions. 

This  technique  can  be  used  to  add  vertices  into  a  given 
triangulation.  For  instance,  if  the  error  indicator  is  edge- 
based,  any  marked  mesh  edge  is  split.  Mesh  entity  split¬ 
ting  guarantees  nesting,  is  not  stable,  and  will  not  over- 
refine  if  the  mesh  entities  that  are  marked  for  refinement 
are  the  only  ones  to  be  split. 

For  a  given  mesh  region  to  refine,  Golias  and  Tsiboukis 
[30]  split  its  longest  edge,  which  leads  to  the  refinement 
of  all  tetrahedra  connected  to  the  edge.  At  this  point, 
the  scheme  guarantees  nesting,  is  not  stable,  and  does  not 
artificially  refine.  Then,  Delaunay  transformations  and 
node  relaxation  (repositioning)  techniques  are  applied  to 
improve  the  quality  of  the  resulting  triangulation  (nesting 
is  lost).  The  Delaunay  transformations  used  are: 

1.  exchange  of  interface  faces  (in  Fig.  36  upper-left, 
when  the  faces  bounded  by  ,  m'^2 

m^2  1  m^5  }  Classified  on  the  same  model 
face,  a  2-to-2  swap  which  is  a  degenerate  case  of 
the  2-to-3  swap  can  be  applied), 

2.  2-to-3  and  3-to-2,  and 

3.  local  transformation  of  tetrahedron  (the  above  three 
transformations  are  applied  recursively  to  the  tetra¬ 
hedron  under  consideration,  then  the  tetrahedron’s 
neighbors,  etc). 
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Muthukxishnan  et  al.  [53]  sort  mesh  regions  that  are 
to  be  refined  with  respect  to  increasing  length  of  their 
longest  edges.  The  first  region  to  be  refined  is  the  one 
at  the  end  of  the  list.  Before  splitting  the  longest  edge, 
the  regions  connected  to  the  edge  are  examined.  If  a 
connected  region  has  a  longest  edge  different  from  the 
edge  to  be  split,  it  is  put  in  the  list  of  regions  to  be 
refined  at  the  appropriate  rank.  After  the  split,  the  list 
is  updated.  This  refinement  scheme  is  actually  identical 
to  the  scheme  described  by  Rivara  and  Levin  [60]  and 
therefore  has  the  same  properties.  It  is  followed  by  a 
node  repositioning  procedure  (nesting  is  lost). 

Lo  [46]  sorts  (in  an  approximate  way)  the  mesh  edges 
marked  for  refinement  with  respect  to  increasing  length. 
The  mesh  edge  at  the  end  of  the  list  is  split  and  the  list 
is  updated.  This  scheme  is  different  from  the  one  by  Ri¬ 
vara  and  Levin  [60]  and  Muthukrishnan  et  al.  [53]  since 
only  edges  marked  for  refinement  will  be  split.  At  this 
point,  the  scheme  guarantees  nesting,  is  not  stable,  and 
does  not  artificially  refine.  It  is  followed  by  a  triangu¬ 
lation  optimization  procedure  which  makes  use  of  node 
repositioning  and  local  transformations  (nesting  is  lost). 
These  local  transformations  are: 

1.  2-to-3, 

2.  3-to-2,  and 

3.  4-to-4  which  is  an  edge  removal  when  there  are  four 
mesh  regions  connected  to  an  edge. 

4.2.6  Refinement  Using  Full  Set 
of  Subdivision  Patterns 

Refinement  is  performed  by  marking  appropriate  mesh 
edges  for  refinement  and  applying  subdivision  patterns  to 
each  mesh  region.  Each  mesh  region  has  from  zero  to 
six  marked  edges.  Subdivision  patterns  for  each  possible 
configuration  of  marked  edges  have  been  developed  in 
order  to  annihilate  any  over-refinement.  There  are  ten 
possible  patterns  which  are  as  follows  (Fig.  48): 

1.  1-edge:  this  is  the  classic  1:2  subdivision  pattern 
(one  template) 

2.  2-edge  (this  is  also  the  Green  II  in  [9]): 

a.  One  face  has  two  marked  edges  (two  templates) 

b.  All  faces  have  one  marked  edge  (one  template) 

3.  3-edge: 

a.  One  face  has  three  marked  edges:  this  is  the 
classic  1:4  subdivision  pattern  (one  template) 

b.  Two  faces  have  two  marked  edges  (four  tem¬ 
plates) 

c.  Three  faces  have  two  marked  edges  (eight  tem¬ 
plates) 

4.  4-edge: 

a.  One  face  has  three  marked  edges  (four  tem¬ 
plates) 

b.  All  faces  have  two  marked  edges  (sixteen  tem¬ 
plates) 


5.  5-edge  (four  templates) 

6.  6-edge:  this  is  the  classic  1:8  subdivision  pattern 
(one  template) 


1-edge  4-edge 


Figure  48.  Subdivision  patterns  in  three  dimensions 


When  only  the  1:2,  1:4,  and  1:8  subdivision  patterns  are 
used,  there  is  no  possible  triangulation  incompatibility  at 
the  face  level,  in  other  words,  the  subdivision  patterns 
on  both  sides  of  a  face  with  either  one  or  three  marked 
edges  will  always  match  (at  the  face  level).  Inclusion  of 
all  the  refinement  types  requires  explicit  consideration  of 
triangulation  compatibility  at  the  face  level.  If  a  face  with 
two  and  only  two  marked  edges  has  been  triangulated 
due  to  the  subdivision  of  one  region  using  that  face,  the 
template  used  to  subdivide  the  other  region  must  match 
the  face  triangulation.  Since  there  are  a  priori  two  ways 
to  triangulate  a  face  with  two  marked  edges  (Fig.  49), 
any  pattern  which  has  N  faces  with  two  and  only  two 
marked  edges  needs  2^  templates. 


Figure  49.  The  two  ways  to  triangulate 
a  mesh  face  with  two  marked  edges 

As  is,  this  refinement  scheme  is  not  stable  since  it  is  pos¬ 
sible,  and  likely,  that  an  angle  (solid)  will  be  bisected 
more  than  once  when  multiple  refinements  are  applied  in 
the  same  areas.  However,  it  can  be  made  stable  at  the 
price  of  some  over-refinement.  Assuming  the  quality  of 
the  initial  triangulation  is  Qi,  stability  requires  that 
for  any  subsequent  triangulation  {i  >  1)  its  quality 
Qi  is  such  that  Qi  >  qi  with  qi  =  aQi  where  a  is  some 
constant.  Given  a  mesh  region  with  at  least  one  marked 
edge  but  fewer  than  six,  the  template  corresponding  to  the 
number  of  marked  edges  is  applied  and  the  optimization 
procedure  (with  qi  as  the  threshold)  is  applied  locally  to 
the  subdivided  mesh  region.  If  the  optimization  procedure 
is  successful,  nothing  else  has  to  be  done  for  that  mesh 
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region.  However,  if  the  optimization  procedure  is  unsuc¬ 
cessful,  the  situation  that  existed  prior  to  the  application 
of  the  template  is  recovered  and  the  subdivision  pattern 
is  upgraded  by  marking  an  additional  edge.  This  process 
is  repeated  until  the  optimization  procedure  is  successful 
or  the  number  of  marked  mesh  edges  reaches  six.  Re¬ 
call  that  the  application  of  the  1:8  template  (with  shortest 
inner  diagonal)  does  not  affect  the  stability  of  the  refine¬ 
ment  scheme  (neutral)  [29,  5].  Another  approach  is  to 
directly  upgrade  to  six  marked  mesh  edges  and  apply  the 
1 :8  template.  It  should  be  noted  that  as  soon  as  a  subdivi¬ 
sion  pattern  is  upgraded,  neighboring  mesh  regions  have 
to  be  reprocessed  for  subdivision. 

In  the  context  of  curved  model  boundary,  vertices  re¬ 
sulting  from  refinement  that  are  classified  on  the  model 
boundary  need  to  be  “snapped”  to  the  appropriate  model 
entity.  For  example,  when  studying  the  flow  around  an 
airfoil,  it  is  critical  to  be  able  to  snap  refinement  vertices 
to  the  airfoil  (especially  at  the  leading  edge)  in  order  to 
make  sure  the  resulting  flow  corresponds  to  the  actual 
airfoil  geometry.  Difficulties  may  arise  as  moving  a  ver¬ 
tex  to  its  destination  target  can  generate  invalid  elements 
especially  when  the  triangulation  is  rather  coarse.  Fig¬ 
ure  50  illustrates  this  problem  in  two  dimensions  when 
the  triangulation  around  a  circular  hole  is  selectively  re¬ 
fined.  If  the  snapping  of  a  refinement  vertex  causes  a 
mesh  region  to  be  invalid  or  of  poor  quality,  the  local  re¬ 
triangulation  tools  described  above  can  be  used  to  attempt 
to  remove  that  mesh  region.  This  process  is  repeated  un¬ 
til  the  refinement  vertex  can  be  snapped.  Note  that  other 
local  retriangulation  techniques,  such  as  edge  collapsing, 
mesh  entity  splitting,  and  local  remeshing  can  be  applied 
to  these  situations.  It  is  possible  that  local  retriangula¬ 
tion  tools  may  not  succeed  in  snapping  all  refinement 
vertices,  however,  it  is  believed  that  local  remeshing  will 
permit  all  snappings.  Efforts  are  under  way  to  complete 
the  appropriate  algorithms.  A  mesh  adaptation  proce¬ 
dure  should  in  theory  not  only  be  stable  (with  respect  to 
triangulation  quality)  but  also  capable  of  snapping  all  re¬ 
finement  vertices  classified  on  the  model  boundary  to  the 
proper  model  entities.  This  new  requirement  lessens  the 
importance  of  stability  and  justifies  the  presented  mesh 
adaptation  procedure  which  makes  use  of  local  retriangu¬ 
lation  tools  to  optimize  the  current  triangulation  and  snap 
refinement  vertices. 


Figure  50.  Snapping  refinement  vertex  to 
the  model  boundary  in  two  dimensions 
can  generate  geometric  invalidity 


4,3.  Derefinement 

Schemes  that  use  subdivision  patterns  or  bisection  for  re¬ 
finement  can  derefine  by  simply  reversing  the  refinement 
process  [6,  59,  48,  38,  57].  To  illustrate  this  concept,  con¬ 
sider  the  methodology  employed  by  Biswas  and  Strawn 
[6]  which  is  representative  of  such  derefinement  schemes. 
If  two  sibling  edges  (same  parent  edge)  are  marked  for 
derefinement,  they  are  replaced  by  the  parent  edge  and 
all  parent  elements  sharing  the  parent  edge  are  reinstated. 
The  procedure  described  for  refinement  and/or  conformity 
can  then  be  applied  to  the  set  of  elements  that  have  been 
reinstated.  Figure  51  shows  a  simple  example  of  dere¬ 
finement.  In  Figure  51. a,  edges  marked  with  a  “d”  are  to 
be  derefined.  Once  the  parent  edges  have  been  reinstated, 
any  parent  edge  with  at  least  one  child  edge  not  marked 
“d”  is  marked  “r”  (Fig.  51.b).  The  refinement  proce¬ 
dure  described  earlier  is  then  applied  to  produce  the  final 
derefined  triangulation  (Fig.  5 1  .c).  It  should  be  noted  that 
Biswas  and  Strawn  [6]  perform  refinement  and  derefine¬ 
ment  simultaneously.  Derefinement  has  been  separated 
only  in  the  scope  of  the  present  paper.  In  order  to  effi¬ 
ciently  reinstate  parent  entities,  parent  elements  and  edges 
are  stored  resulting  in  an  overhead  estimated  at  15%  of 
total  memory  requirements  in  Biswas  and  Strawn’ s  case. 
It  should  be  noted  that  any  triangulation  in  the  sequence 
cannot  be  coarser  than  the  first  one. 


Figure  51.  Derefinement  example 


Derefinement  is  performed  here  by  using  a  local  retrian¬ 
gulation  technique  that  deletes  a  vertex:  edge  collapsing. 
A  mesh  edge  is  derefined  by  collapsing  it  to  one  of  its 
end  vertices.  A  description  of  the  algorithm  follows  (see 
also  Fig.  52  for  a  graphical  description): 

1.  Check  if  edge  collapsing  is  topologically  possible.  If 
it  is  possible,  one  end  vertex  is  the  collapsed  vertex 
(m^i )  while  the  other  is  the  target  vertex 

2.  Check  if  edge  collapsing  is  geometrically  possible 

3.  Delete  all  mesh  regions  connected  to  mT°,  which 
produces  a  polyhedral  cavity 

4.  Connect  the  faces  of  the  polyhedral  cavity  to 
to  form  new  mesh  regions 

Since  edge  collapsing  locally  modifies  a  Geometric 
(valid)  triangulation  [67,  77],  one  has  to  make  sure  the 
validity  of  the  triangulation  is  not  violated  by  the  mod¬ 
ification  (this  check  refers  to  step  1  of  the  algorithm). 
Since  any  mesh  entity  is  classified  against  the  model,  it 
is  always  possible  to  predict  such  violations.  Figure  53 
contains  the  pseudo-code  to  check  if  a  mesh  edge  can 
be  collapsed  to  one  of  its  end  vertices.  It  returns  TRUE 


Figure  52.  Edge  collapsing  in  three  dimensions 


if  the  mesh  edge  can  be  collapsed  (FALSE  otherwise). 
Figure  54  illustrates  graphically  some  of  the  cases  where 
edge  collapsing  is  not  possible  which  are  pointed  out  in 
the  pseudo-code. 

Before  physically  collapsing  the  edge,  the  geometry  of  the 
mesh  regions  to  be  created  can  be  predicted  exactly  (this 
check  refers  to  step  2  of  the  algorithm).  The  volumes  of 
the  new  mesh  regions  can  be  computed  by  considering 
all  mesh  regions  which  are  connected  to  but  not 
connected  to  and  virtually  moving  to 
Since  the  computation  of  the  volume  of  a  mesh  region 
always  consider  the  bounding  vertices  in  a  certain  order, 
the  (virtual)  movement  of  one  of  its  bounding  vertices 
is  valid  only  if  the  new  volume  is  positive.  Therefore, 
one  can  always  tell  beforehand  if  the  to-be  created  mesh 
regions  are  invalid.  The  quality  of  the  to-be  created  mesh 
regions  can  be  predicted  as  well.  If  the  quality  of  the  to 
be  created  elements  is  not  good  enough  with  respect  to 
some  predetermined  threshold,  the  derefinement  of  the 
edge  need  not  be  performed.  This  is  important  in  order 
to  guarantee  the  stability  of  the  refinement/derefinement 
scheme.  Also,  assuming  both  end  vertices  are  candidates 
to  be  the  target  vertex,  the  target  vertex  that  would  create 
the  “better”  triangulation  of  the  two  is  chosen. 

4.4,  Complete  Mesh  Adaptation  Procedure 

The  actual  implementation  of  the  mesh  adaptation  scheme 
uses  the  following  steps: 

1.  Derefinement  using  edge  collapsing  as  described 
above 

2.  Global  optimization  with  qi  —  Qi 

3.  Refinement  using  full  set  of  subdivision  patterns 
without  consideration  for  stability 

4.  Refinement  vertex  snapping  (to  the  model  boundary) 

5.  Global  optimization  with  qi  =  Qi 

So  far,  problems  due  to  the  non-stability  of  the  imple¬ 
mented  refinement  scheme  have  not  appeared.  If  they 
happen,  the  refinement  can  be  made  stable  as  described 
above  at  the  price  of  some  over-refinement. 

4.5.  Parallelization  of  Mesh  Adaptation 

Today’s  CFD  computations  are  costly  both  in  CPU  time 
and  memory.  For  big  enough  problems,  the  flow  solver 
cannot  be  run  on  a  classic  scalar  workstation  for  which 
performance  and  memory  are  limited.  For  large-scale 
analysis  of  fluid  flows,  it  is  necessary  to  use  a  parallel 
flow  solver.  Since  the  mesh  adaptation  is  an  integral  part 


Get  bounding  vertices  (mT?  C  i-mTi  C  gT^'^) 
of  edge  \Z 

if  dj  =  dj 

J  f  _  T'“2 

g-'i  “  9-'2 

if  dj  =  3  return  TRUE  (ok  to  collapse) 
else  return  FALSE  (cannot  collapse)  (Fig. 
54. a) 

else 

if  dj  =  3  or  d2  =  3  return  TRUE  (target 
vertex  is  the  one  classified  on  lower 
order  model  entity) 

At  this  point,  the  two  mesh  vertices  are 
classified  on  model  boundary 
if  dJ  =  3  return  FALSE 

Switch  (if  necessary)  and  so  that 

dj  >  dj  (from  now  on,  target  vertex  will 
be  if  collapsing  is  possible) 

if  jTji  ^  jTji  return  FALSE  (Fig.  54. b) 

At  this  point,  the  two  vertices  are  classified 
on  model  boundary  and  the  edge  is  classified 
on  the  model  entity  of  higher  order 

for  each  pair  of  mesh  edges  C  i  m'^i  ^ 

)  that  connect  to  mlf  and  respec¬ 

tively  and  connect  to  each  other 

if  d2  =  3  or  dg  =  3  continue 

if  d2  =  dg 

if  gT^^  ^  return  FALSE 

else  if  d^  =  1  return  FALSE 
At  this  point,  the  two  edges  are  classi¬ 
fied  on  same  model  face  or  one  is  clas¬ 
sified  on  model  face  and  the  other  is 
classified  on  the  model  face's  boundary 
Switch  (if  necessary)  m^g  so 

that  d\  >  dg 

.2 

Find  face  ^  g'^i  ^  bounded  by 

^m'^i  'm'^2  I  m^3  ^ 

if  rn'^i  does  not  exist,  return  FALSE 

^2 

if  gTj  ^  9^2^  return  FALSE  (Fig.  54. c) 

^2 

for  each  pair  of  mesh  faces  (m'^i  C  „7\  ^  H 

gT^'^)  that  connect  to  „,T°  and  respec¬ 

tively  and  connect  to  each  other  by  a  mesh 
edge 

if  d|  =  2  and  d^  =  2  return  FALSE  (Fig. 
54. d) 

if  do  not  bound  a  mesh  region, 

return  FALSE 

return  TRUE 

Figure  53.  Pseudo-code  for  checking 
topological  validity  for  edge  collapsing 

of  the  flow  solver,  it  must  be  running  in  parallel  as  well 
in  order  not  to  become  a  bottleneck. 

4.5.1  Derefinement 

If  a  mesh  edge  is  marked  for  derefinement,  it  is  at¬ 
tempted  to  be  collapsed.  If  the  polyhedron  pol[^T^)  is 
on  processor  pi,  the  edge  collapsing  is  performed  on  pi. 
If  poZ(^Tf)  is  not  fully  on  pi,  the  missing  mesh  regions 
are  requested  from  the  appropriate  processors.  When  all 
processors  are  done  traversing  their  lists  of  mesh  edges, 
the  processors  that  have  received  requests  send  (migrate) 
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0  0 
mT2‘=gTi 

G^l 

m‘'‘i'=g'''i 


Figure  54.  Some  cases  when  edge  collapsing  in 
three  dimensions  is  not  possible  due  to  topology 


To-be  collapsed 


Figure  55.  Mesh  migration  to  support 
parallel  distributed  derefinement 

the  requested  mesh  regions.  In  Figure  55,  processor  po 
requests  mesh  regions  from  processors  (pi,  p2,  Ps)  and 
the  requested  mesh  regions  are  migrated.  If  there  is  con¬ 
flict,  the  processor  with  lowest  p,  has  priority.  On  the 
next  iteration,  it  is  the  processor  with  highest  pi  that  will 
have  priority.  This  switching  is  done  to  prevent  too  much 
load  imbalance  at  completion.  The  process  of  traversing 
the  list  of  mesh  edges  and  sending/receiving  requests  con¬ 
tinues  until  all  marked  mesh  edges  have  been  collapsed 
(more  exactly,  have  been  attempted  to  be  collapsed).  Be¬ 
cause  mesh  regions  are  migrated,  it  is  possible  that  the 
processors  are  not  well  balanced  after  the  derefinement 
step.  The  triangulation  is  therefore  submitted  to  a  load 
balancing  step  (at  the  region  level)  before  going  further. 
Figure  56  shows  the  speed-ups  for  a  triangulation  of  ap¬ 
proximately  85,000  elements  where  50%  of  the  mesh 
edges  are  derefined  (the  resulting  triangulation  has  ap¬ 
proximately  46,000  elements). 

4.5.2  IViangulation  Optimization 

Assuming  the  current  triangulation  is  partitioned,  each 
processor  p,  (0  <  i  <  rip)  optimizes  its  own  partition 
(pTp,)  considering  a  global  quality  threshold  qi.  As 
processor  p,  pops  a  mesh  region  from  its  queue  Qui, 
two  situations  may  occur: 

1 .  All  polyhedra  to  be  considered  for  edge  removal  and 
multi-face  removal  are  fully  on  pi  (that  is,  all  mesh 
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Figure  56.  Speed-ups  for  derefinement 
(85,000  elements  —  50%  edges  derefined) 

regions  of  all  polyhedra  belong  to  pi),  in  that  case, 
the  proper  local  retriangulation  tool  can  be  applied  (if 
needed)  and  any  new  mesh  region  of  quality  below 
qi  is  pushed  in  Qui 

2.  At  least  one  polyhedron  is  not  fully  on  pj,  in  that 
case.  Pi  requests  for  each  polyhedron  (concerning 
edge  removal  and  multi-face  removal)  any  mesh 
region  that  is  not  on  pi  and  push  back  the  mesh 
region  in  Qui 

The  mesh  region  popping  process  continues  until  Qui 
is  empty  or  stuck  (does  not  change).  Clearly,  requests 
concerning  mesh  regions  that  have  been  deleted  since 
are  cancelled.  After  a  synchronization  step,  all  proces¬ 
sors  examine  the  requests  they  have  received  and  send 
(migrate)  the  appropriate  mesh  regions  to  the  appropri¬ 
ate  processors.  If  a  mesh  region  is  requested  by  several 
processors,  the  processor  with  lowest  pi  has  priority  and 
will  be  granted  the  mesh  region.  On  the  next  iteration, 
it  is  the  processor  with  highest  pi  that  will  have  prior¬ 
ity.  This  switching  is  done  to  prevent  too  much  load 
imbalance  at  completion.  Each  processor  pi  adds  to  its 
queue  Qui  any  new  mesh  region  of  quality  below  qi  that 
it  has  received  and  restarts  popping  mesh  regions.  The 
combined  process  of  emptying  the  queue  and  migrating 
requested  mesh  regions  terminates  when  all  queues  Qui 
(0  <  i  <  Tip)  are  empty  or  stuck  and  there  is  no  mesh 
region  to  migrate.  Because  mesh  regions  are  migrated, 
it  is  possible  that  the  processors  are  not  well  balanced 
after  the  optimization  step.  The  triangulation  is  therefore 
submitted  to  a  load  balancing  step  (at  the  region  level) 
before  going  further.  Figure  57  shows  the  speed-ups  for 
a  triangulation  of  approximately  85,000  elements. 

4.5.3  Refinement 

Any  mesh  face  on  some  partition  boundary  with  at 
least  one  marked  mesh  edge  is  triangulated  using  two- 
dimensional  subdivision  patterns  (Fig.  58).  Since  two 
sibling  mesh  faces  (physically  identical  mesh  faces  on  two 
neighboring  procs)  have  same  orientation,  it  is  guaranteed 
that  the  application  of  these  templates  will  produce  phys¬ 
ically  identical  triangulations  (in  terms  of  child  faces). 
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Figure  57.  Speed-ups  for  the 
optimization  procedure  (85,000  elements) 


2  2  2 


Figure  58.  Subdivision  patterns  at  the  mesh  face  level 

Once  all  mesh  faces  on  the  partition  boundary  are  subdi¬ 
vided,  links  for  all  new  mesh  entities  are  updated.  Then, 
each  processor  can  apply  the  three-dimensional  templates 
on  any  mesh  region  with  at  least  one  marked  edge  (as 
described  above)  without  any  communication. 

Once  all  appropriate  mesh  regions  have  been  subdivided, 
the  refinement  vertices  which  are  classified  on  the  model 
boundary  need  to  be  snapped  to  the  corresponding  model 
entity.  Since  snapping  makes  use  of  the  local  retrian¬ 
gulation  tools,  the  technique  to  parallelize  that  process 
is  similar  to  the  one  used  to  parallelize  the  derefinement 
and  optimization  steps.  All  processors  iterate  on  a  two 
step  process:  (i)  (sequential)  vertex  snapping  along  with 
requests  for  missing  mesh  regions,  and  (ii)  sending  of  re¬ 
quests  and  migration  of  requested  mesh  regions  until  all 
refinement  vertices  have  been  attempted  to  be  snapped. 
At  the  end  of  the  refinement  step,  the  processors  may 
not  be  well  balanced  for  two  reasons:  (i)  refinement  is 
selective,  and  (ii)  mesh  regions  have  been  migrated  (due 
to  snapping).  Therefore,  a  load  balancing  step  is  applied 
before  going  further.  Figure  59  shows  speed-ups  for  the 
refinement  procedure  on  36,000  elements  when  20%  of 
the  mesh  edges  are  refined  (resulting  triangulation  has 
88,000  elements). 

5.  Parallel  Adaptive  Analysis  Procedures 

5.1.  Structure  of  a  Parallel  Adaptive 
Analysis  Procedure 

Although  the  most  computationally  intensive  operations 
in  an  adaptive  analysis  are  of  the  same  type  as  those 
of  a  fixed  mesh  analysis,  an  adaptive  analysis  must  use 
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Figure  59.  Speed-ups  for  parallel  refinement 


Figure  60.  Components  of  a 
parallel  adaptive  analysis  procedure 


more  general  structures  which  effectively  account  for 
the  evolution  of  the  discretization.  The  structure  of  a 
parallel  adaptive  analysis  procedure  follows  directly  from 
the  procedures  used  for  the  parallel  control  of  evolving 
meshes  presented  in  the  previous  sections.  Figure  60 
presents  an  overall  flow  chart  of  a  parallel  automated 
adaptive  analysis  procedure. 
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Two  main  processing  phases  naturally  emerge  in  the  finite 
element  method,  the  “form  phase”,  where  the  local  finite 
element  arrays  at  the  sub-domain  level  are  generated,  and 
the  “solve  phase”,  where  the  global  problem  is  solved. 
Parallel  implementation  of  the  form  phase  is  straightfor¬ 
ward,  in  the  sense  that  it  can  be  performed  in  parallel 
with  no  communication  among  the  processing  nodes. 

On  the  other  hand,  the  efficient  scalable  realization  of  the 
solve  phase  is  a  non  trivial  task.  Current  Multiple  In¬ 
struction/Multiple  Data  (MIMD)  computers  tie  together 
independent  processors  using  a  high  speed  switch  under  a 
message  passing  paradigm.  The  resulting  system  incorpo¬ 
rates  relatively  powerful  individual  processors  with  large 
local  memories.  The  communication  bandwidth  between 
processors  remains  well  below  that  of  the  individual  pro¬ 
cessor  to  memory  bandwidth  resulting  in  significant  cost 
for  interprocessor  communication  compared  to  local  com¬ 
putation.  This  type  of  architecture  has  significant  impact 
on  the  design  of  parallel  algorithms  for  the  solution  of 
large  linear  systems.  Such  algorithms  must  amortize  any 
communication  costs  over  large  amounts  of  simultaneous 
parallel  computation.  Additionally,  the  large  local  data 
space  is  still  only  a  fraction  of  the  global  memory  space 
and  data  cannot  be  highly  duplicated  over  multiple  pro¬ 
cessors  if  full  advantage  is  to  be  taken  of  the  available 
memory.  Under  these  constraints,  two  different  algorithm 
classes  become  attractive  for  the  solution  of  linear  sys¬ 
tems,  Krylov  space  based  iterative  solvers  and  domain 
decomposition  techniques. 

In  the  remainder  of  this  subsection  a  brief  review  of  the 
Krylov  space  based  GMRES  procedure  used  in  the  rotor- 
craft  aerodynamics  discussed  in  subsequent  subsections 
is  given.  Readers  interest  in  more  information  on  Krylov 
space  based  domain  decomposition  methods  are  referred 
to  the  chapters  of  this  report  by  van  der  Vorst  and  Farhat, 
respectively. 

Given  the  non-symmetric  linear  system  A  •  x  =  b, 
the  Generalized  Minimal  Residual  (GMRES)  algorithm 
of  Saad  and  Schultz  [62]  attempts  to  find  the  approx¬ 
imate  solution  po  -I-  z,  z  being  in  the  Krylov  space 
K  =  (ro,  A  •  ro, . . . ,  •  tq)  and  tq  =  b  -  A  ■  po. 

z  is  the  solution  of  the  minimization  problem  minzg/c  || 
b-A'(po+z)  II,  which  is  solved  by  means  of  the  QR  al¬ 
gorithm.  The  GMRES  algorithm  obtains  an  orthonormal 
basis  of  K  by  means  of  a  Gram-Schmidt  procedure  which 
involves  matrix-vector  multiplications  and  dot  products. 
These  operations  represent  the  computer  intensive  part  of 
the  algorithm.  In  general,  all  Krylov  methods  can  be  writ¬ 
ten  in  terms  of  these  two  basic  kernels.  It  is  therefore  im¬ 
portant  to  devise  efficient  ways  of  performing  distributed 
matrix-vector  and  dot  product  operations  in  parallel. 

The  matrix-vector  multiplications  necessitate  the  ex¬ 
change  of  data  through  the  inter-processor  boundaries. 
In  order  to  overlap  communication  and  computation  for 
efficiency  reasons,  these  operations  can  be  realized  fol¬ 
lowing  a  four  step  procedure  on  each  processing  node: 
(i)  send  data  relative  to  the  inter-processor  boundaries 
to  each  neighboring  processor,  (ii)  perform  computations 


involving  only  data  relative  to  nodes  that  lie  within  the 
internal  volume  of  the  partition,  (Hi)  receive  data  relative 
to  the  inter-processor  boundaries  from  all  the  neighbors, 
(iv)  perform  computations  involving  only  data  relative  to 
nodes  lying  on  the  inter-processor  boundaries. 

For  the  implementation  of  the  dot  product  operations, 
nodes  that  lie  on  the  inter-processor  boundaries  are  ran¬ 
domly  split,  so  that  two  partitions  that  share  an  internal 
boundary  are  assigned  only  a  subset  of  the  nodes  of  that 
internal  boundary.  Each  processing  node  then  performs 
the  dot  product  involving  nodes  contained  in  its  internal 
volume  and  its  subset  of  nodes  on  the  partition  bound¬ 
aries.  Global  sum  of  the  local  results  at  the  processor 
level  yields  the  global  dot  product  result. 

The  minimization  problem  in  the  GMRES  algorithm  can 
be  written  in  terms  of  an  upper  Hessenberg  matrix,  whose 
entries  are  essentially  the  results  of  the  dot  products  per¬ 
formed  during  the  orthogonalization  procedure.  At  the 
end  of  the  Gram-Schmidt  procedure,  each  processing 
node  has  then  complete  knowledge  of  the  upper  Hes¬ 
senberg  matrix  and  it  is  therefore  able  to  perform  the 
solution  of  the  minimization  problem  independently  with 
no  communication.  It  should  be  remarked  that  the  size 
of  the  Hessenberg  matrix  is  the  size  of  the  Krylov  space 
employed,  typical  values  for  the  applications  here  consid¬ 
ered  being  around  5-30.  The  computer  intensive  SAXPY 
operations  needed  in  order  to  update  the  solution  of  the 
linear  system  are  consequently  performed  in  parallel  with 
no  communication.  Once  convergence  is  achieved  in  the 
iterative  linear  solver,  each  processing  node  has  complete 
knowledge  of  the  incremental  solution  at  the  current  New¬ 
ton  or  time  step,  and  it  is  therefore  able  to  update  the 
current  state  completely  independently,  without  any  in¬ 
ter-processor  communication. 

It  should  be  noted  that  the  GMRES  algorithm,  like  all 
other  Krylov  methods,  does  not  need  to  operate  on  the 
system  matrix  by  itself,  but  just  needs  to  compute  prod¬ 
ucts  of  this  Jacobian  matrix  with  a  given  vector.  One 
can  take  advantage  of  this  feature,  and  develop  a  ma¬ 
trix — free  version  of  the  algorithm  [37,  36]  in  which  the 
matrix-vector  products  are  approximated  with  a  finite  dif¬ 
ference  stencil.  This  has  the  advantage  of  avoiding  the 
storage  of  the  tangent  matrix,  thus  realizing  a  substan¬ 
tial  saving  of  computer  memory  at  the  cost  of  additional 
on-processor  computations.  In  the  matrix-free  version  of 
the  algorithm,  matrix-vector  multiplications  of  the  form 
A(f)  ■  u  are  approximated  by  means  of  a  finite  dilference 
of  residuals  b  as 


A(f)  ■  u  = 


b  (f)  -  b  (f  -H  £u) 


where  f  is  the  vector  of  the  field  variable  nodal  values  and 
£  is  a  perturbation  parameter  which  is  computed  minimiz¬ 
ing  the  truncation  error,  which  results  from  truncating  the 
Taylor  expansion,  and  the  cancellation  error,  which  is  a 
consequence  of  operating  in  finite  precision  arithmetic. 
The  addition  of  preconditioners  to  the  solution  strategy 
is  a  necessary  ingredient  for  the  successful  application  of 
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Krylov  space  solvers.  However,  they  can  complicate  their 
parallelization,  leading  to  increased  data  communications 
or  the  need  for  a  global  ordering.  However,  depend¬ 
ing  upon  the  underlying  problem,  local  preconditioners 
may  prove  adequate  to  assure  convergence  in  a  reason¬ 
able  number  of  iterations,  or  the  preconditioner  may  be 
calculated  one  time,  stored,  and  used  repeatedly. 

5.2.  Finite  Element  Code  for 
Rotorcraft  Aerodynamics 

This  section  presents  an  parallel  adaptive  procedure  for 
the  automated  aerodynamic  analysis  of  helicopter  rotors 
based  on  the  procedures  discussed  in  this  paper.  Adaptive 
analyses  on  unstructured  discretizations  represent  an  ef¬ 
fective  and  accurate  method  to  address  the  complex  phys¬ 
ical  phenomena  that  characterize  rotorcraft  systems.  The 
problem  of  the  accurate  numerical  simulation  of  these 
phenomena  has  recently  stimulated  a  vigorous  research 
effort  in  the  scientific  community,  certainly  prompted  by 
the  fact  that  rotor-body  interactions,  transonic  effects, 
wake  effects  and  blade  stall,  all  have  a  major  impact  on 
the  performance,  stability  and  noise  characteristics  of  he¬ 
licopter  rotors. 

One  of  the  most  important  characteristics  and  distinguish¬ 
ing  features  of  the  software  presented  here  is  that  all  the 
different  phases  of  the  analysis,  namely  the  mesh  parti¬ 
tioning,  the  finite  element  solution,  the  error  indication, 
the  mesh  adaptation  and  the  subsequent  load  balancing, 
are  realized  without  leaving  the  parallel  environment.  In 
contrast  with  other  procedures  that  perform  only  part  of 
the  analysis  in  parallel,  as  for  example  just  the  finite  el¬ 
ement  solution  phase,  our  approach  has  the  advantage  of 
making  better  use  of  the  power  of  a  distributed  memory 
architecture,  leading  to  an  integrated  software  environ¬ 
ment,  reducing  the  i/o  and  avoiding  the  bottlenecks  that 
are  always  present  when  one  tries  to  solve  certain  phases 
of  the  analysis  in  serial,  especially  when  very  large  prob¬ 
lems  are  addressed. 

This  integrated  approach  to  the  parallel  adaptive  solu¬ 
tion  of  PDE’s  has  lead  us  to  select  the  message  passing 
paradigm  as  our  method  of  choice  for  the  parallel  pro¬ 
gramming.  This  is  in  contrast  with  the  trend  shown  by 
some  recent  publications  [36,  39,  52],  where  parallel  finite 
element  methodologies  on  fixed  meshes  have  been  devel¬ 
oped  based  on  data  parallel  techniques.  In  fact,  we  be¬ 
lieve  that  the  software  development  is  more  easily  accom¬ 
plished  in  a  message  passing  programming  model  when 
one  has  to  deal  with  adaptive  strategies  and  mesh  mod¬ 
ification  techniques.  With  the  idea  of  developing  a  uni¬ 
form  software  environment,  we  have  used  portable  mes¬ 
sage  passing  protocols  in  each  stage  of  the  analysis.  The 
implementation  has  been  carried  out  using  the  message 
passing  library  standard  MPI  [1]  and  it  has  been  tested 
on  IBM  SP-1  and  SP-2  systems. 

The  procedure  developed  employs  a  stabilized  finite  ele¬ 
ment  formulation  which  is  valid  for  forward  flight  and  for 
hovering  rotor  problems,  as  well  as  for  general  unsteady 
and  steady  compressible  flow  problems.  The  linear  alge¬ 


bra  is  solved  by  means  of  a  scalable  implementation  of 
the  standard  and  matrix-free  GMRES  algorithms.  Simple 
techniques  are  used  for  estimating  regions  of  high  error 
with  the  purpose  of  driving  the  adaptive  procedures. 
Techniques  to  effectively  handle  the  far-field  and  symme¬ 
try  boundary  conditions  for  a  hovering  rotor  are  consid¬ 
ered.  Results  are  presented  to  demonstrate  the  ability  of 
the  parallel  adaptive  procedures  to  solve  rotorcraft  aero¬ 
dynamics  problems. 

Consideration  is  also  given  to  measures  of  efficiency  and 
scalability  of  the  parallel  adaptive  procedures  that  have 
been  developed.  The  importance  of  these  measures  are 
demonstrated. 

5.2.1  Finite  Element  Formulation 

The  initial/boundary  value  problem  can  be  expressed  by 
means  of  the  Euler  equations  in  quasi-linear  form  as 

Ut  +  Ai-U.i-E,  {i  =  l,...,nsd)  (17) 

plus  well  posed  initial  and  boundary  conditions.  In 
equation  (17),  risd  is  the  number  of  space  dimen¬ 
sions,  while  U  =  p{l,ui,U2,uz,e)  are  the  con¬ 
servative  variables.  A,  •  U,j  =  Fi,j  where  F;  = 
pui{l,ui,U2,U2,e)  +  p{0,6ii,62i,Ssi,Ui)  is  the  Eu¬ 
ler  flux,  and  E  =  p{0,bi,b2,bs,  biUi  +  r)  is  the  source 
vector.  In  the  previous  expressions,  p  is  the  density, 
u  =  (ui,«2,U3)  is  the  velocity  vector,  e  is  the  total 
energy,  p  is  the  pressure,  6ij  is  the  Kronecker  delta, 
b  =  {bi,b2,bz)  is  the  body  force  vector  per  unit  mass 
and  r  is  the  heat  supply  per  unit  mass. 

The  Time-Discontinuous  Galerkin  Least-Squares  finite 
element  method  is  used  in  this  effort  [70,  71].  The 
TDG/LS  is  developed  starting  from  the  symmetric  form 
of  the  Euler  equations  expressed  in  terms  of  the  entropy 
variables  V  and  it  is  based  upon  the  simultaneous  dis¬ 
cretization  of  the  space-time  computational  domain.  A 
least-squares  operator  and  a  discontinuity  capturing  term 
are  added  to  the  formulation  for  improving  stability  with¬ 
out  sacrificing  accuracy.  The  TDG/LS  finite  element 
method  takes  the  form 

j  ■  U(V'‘)  -  •  Fi(v'‘)  -I-  W'‘  •  E(V'‘))  dQ 

[  W'"”  •U(V'‘“)d29-  /  •U(V'*“)dP 

-I-  J  W'‘Fi(V'*).dP 

+  ’t'  J  {cVf’-)-r{cv'-)iQ 

e=l 

(”c/)n  p 

+  Y\  i^'‘VfW'’.diag[Ao]V«V‘’dQ  =  0.  (18) 

e=l 

Integration  is  performed  over  the  space-time  slab  Qn, 
the  evolving  spatial  domain  V{t)  of  boundary  r(/)  and 
the  surface  P„  described  by  r(i)  as  it  traverses  the  time 
interval  /„  and  are  suitable  spaces 
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for  test  and  trial  functions,  while  r  and  are  appropriate 
stabilization  parameters.  Aq  =  dlJ/dY  is  the  metric 
tensor  of  the  transformation  from  conservation  to  entropy 
variables.  Refer  to  [70,  71]  for  additional  details  on  the 
TDG/LS  finite  element  formulation. 

Two  different  three  dimensional  space-time  finite  ele¬ 
ments  have  been  implemented.  The  first  is  based  on 
a  constant  in  time  interpolation,  and,  having  low  order 
of  time  accuracy  but  good  stability  properties,  it  is  well 
suited  for  solving  steady  problems  using  a  local  time  step¬ 
ping  strategy.  The  second  makes  use  of  linear-in-time 
basis  functions  and,  exhibiting  a  higher  order  temporal  ac¬ 
curacy,  is  well  suited  for  addressing  unsteady  problems, 
such  as,  for  example,  forward  flight.  In  these  cases,  mov¬ 
ing  boundaries  are  handled  by  means  of  the  space-time 
deformed  element  technique  [84]. 

For  efficiently  solving  hover  problems  a  formulation  start¬ 
ing  from  the  Euler  equations  written  in  a  rotating  frame 
is  included  in  the  program.  This  allows  treatment  of  a 
hovering  rotor  as  a  steady  problem  when  the  unsteadi¬ 
ness  in  the  wake  can  be  neglected,  thus  allowing  the  use 
of  the  less  computationally  expensive  constant-in-time 
formulation. 

Assuming  that  the  axis  of  rotation  is  coincident  with  the 
z  axis  and  that  the  angular  velocity  is  fi,  the  compressible 
Euler  equations  in  a  rotating  frame  can  be  expressed  in 
terms  of  the  absolute  flow  variables  U  as 


Ut-f  (Ai-Uil)-Ui -E  +  Eg,  (19) 
0  and  Eg  can  be 


where  vi  =  —fly,  V2  =  fla:,  i>3 
defined  as 


Eg  =  CU  = 


ro 

0 

0 

0 

0 


0 

0 

-n 

0 

0 


0  0  On 

n  0  0 
0  0  0 
0  0  0 
0  0  OJ 


u, 


(20) 


or,  in  terms  of  entropy  variables.  Eg  =  CV,  C  = 
—pT  C.  Clearly,  by  the  nature  of  the  gyroscopic  terms, 
we  have  that  =  —  C. 

We  remark  that  the  rotating  frame  formulation  of  the 
compressible  Euler  equations  in  terms  of  absolute  flow 
variables  is  formally  equivalent  to  a  change  of  variables 
(modification  of  the  jacobians  Aj  into  Aj  —  Ujl)  plus  the 
introduction  of  a  source  term  Eg- 
From  the  formulation  expressed  in  equation  (19),  a 
TDG/LS  finite  element  formulation  can  be  easily  con¬ 
structed  along  the  lines  of  equation  (18).  In  an  inertial 
frame,  a  definition  of  r  that  results  in  full  upwinding  on 
each  mode  of  the  system  [70]  is  given  by 

r  =  A^^(A|’diag(A^^)AfA^^)  (21) 


where 


and  are  the  local  element  coordinates,  xq  and  re¬ 
ferring  to  the  time  dimension.  In  a  rotating  frame,  we 
redefine  A^  as 

=  . 

Solution  to  (21)  can  be  obtained  based  upon  the  eigen- 
problem 

(A[diag(Ao-i)A^-A2Ao-^)  ■Ti  =  0.  (22) 

The  eigenproblem  is  simplified  by  means  of  a  similarity 
transformation  S  that  diagonalizes  Ai  and  and  sym¬ 
metrizes  A3  [86].  However,  the  term  arising  from  Eg 
remains  non-symmetric.  We  have  implemented  both  the 
non-symmetric  and  a  symmetric  form  obtained  by  drop¬ 
ping  the  contribution  of  Eg  from  (22)  and  have  found 
that  for  the  hovering  rotors  that  we  have  studied  in  our 
numerical  simulations,  the  symmetric  form  gives  results 
indistinguishable  from  those  of  the  non-symmetric  form 
at  a  lower  computational  cost. 

Discretization  of  the  weak  form  implied  by  the  TDG/LS 
method  leads  to  a  non-linear  discrete  problem,  which  is 
solved  iteratively  using  a  quasi-Newton  approach.  At 
each  Newton  iteration,  a  non-symmetric  linear  system 
of  equations  is  solved  using  the  GMRES  algorithm.  We 
have  developed  scalable  parallel  implementations  of  the 
preconditioned  GMRES  algorithm  and  of  its  matrix-free 
version  [37,  36].  This  latter  algorithm  approximates  the 
matrix-vector  products  with  a  finite  difference  stencil 
with  the  advantage  of  avoiding  the  storage  of  the  tangent 
matrix,  thus  realizing  a  substantial  savings  of  computer 
memory  at  the  cost  of  additional  on-processor  computa¬ 
tions.  Preconditioning  is  achieved  by  means  of  a  nodal 
block-diagonal  scaling  transformation. 

In  this  work  we  have  implemented  a  simple  error  indicator 
based  on  the  norm  of  the  gradient  of  the  flow  variables 
and  a  slightly  more  sophisticated  one  [47]  for  linear 
elements  which  takes  the  basic  form 

I  Second  Derivative  of  $  | 

h  I  First  Derivative  of  ’5'  |  -|-  £  |  Mean  Value  of  $  |  ’ 

where  is  the  error  indicated  at  node  i,  h  is  a  mesh  size 
parameter,  $  is  the  solution  variable  being  monitored, 
£  is  a  tuning  parameter.  The  second  derivative  of  is 
computed  using  a  variational  recovery  technique. 

The  edge  values  of  the  error  indicator  are  computed  by 
averaging  the  corresponding  two  nodal  values.  These 
edgewise  error  indicator  values  are  then  used  for  driving 
the  mesh  adaptation  procedure.  Appropriate  thresholds 
are  supplied  for  the  error  values,  so  that  the  edge  is  refined 
if  the  error  is  higher  than  the  maximum  threshold,  while 
the  edge  is  collapsed  if  the  error  is  less  than  the  minimum 
threshold. 
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5.2.2  Boundary  Conditions  for  Hovering  Rotors 

The  imposition  of  the  correct  far-field  boundary  condi¬ 
tions  is  a  critical  issue  in  the  analysis  of  hovering  rotors, 
when  one  wants  to  give  an  accurate  representation  of  the 
hovering  conditions  within  a  finite  computational  domain. 
For  determining  the  inflow/outflow  far-field  conditions 
we  have  adopted  the  methodology  suggested  by  Srini- 
vasan  et  al.  [81],  where  the  1-D  helicopter  momentum 
theory  is  used  for  determining  the  outflow  velocity  due 
to  the  rotor  wake  system.  The  inflow  velocities  at  the 
remaining  portion  of  the  far-field  are  determined  con¬ 
sidering  the  rotor  as  a  point  sink  of  mass,  for  achieving 
conservation  of  mass  and  momentum  within  the  compu¬ 
tational  domain. 

Another  important  condition  that  must  be  considered  for 
the  efficient  simulation  of  hovering  rotors  is  the  period¬ 
icity  of  the  flow  field.  This  allows  consideration  of  a 
reduced  computational  domain  given  by  the  angle  of  peri¬ 
odicity  'll)  =  2  'k/th,,  Tib  being  the  number  of  rotor  blades. 
The  introduction  of  the  periodicity  conditions  in  the  ro¬ 
tating  wing  flow  solver  has  been  implemented  treating 
them  as  linear  2-point  constraints  applied  via  transforma¬ 
tion  as  part  of  the  assembly  process.  This  approach  has 
the  double  advantage  of  being  easily  parallelizable  and  of 
avoiding  the  introduction  of  Lagrange  multipliers.  On  the 
other  hand,  it  requires  the  mesh  discretizations  on  the  two 
symmetric  faces  of  the  computational  domain  to  match  on 
a  vertex  by  vertex  basis.  Since  this  is  not  directly  obtain¬ 
able  with  the  currently  used  unstructured  mesh  generator, 
a  mesh  matching  technique  has  been  developed  for  ap¬ 
propriately  modifying  an  existing  discretization. 

In  order  to  simplify  the  discussion,  define  one  of  the 
symmetric  model  faces  as  “master”  and  the  other  as 
“slave”.  The  face  discretization  of  the  slave  model  face 
is  deleted  from  the  mesh,  together  with  all  the  mesh 
entities  connected  to  it.  The  mesh  discretization  of  the 
master  model  face  is  then  rotated  of  the  symmetry  angle  ip 
about  the  axis  of  rotation  and  copied  onto  the  slave  model 
face,  yielding  the  required  matching  face  discretizations. 
The  matching  procedure  is  then  completed  filling  the  gap 
between  the  new  discretized  slave  face  and  the  rest  of 
the  mesh  using  a  face  removal  technique  followed  by 
smoothing  and  mesh  optimization. 

The  imposition  of  the  constraints  can  be  formalized  in 
the  following  manner.  Consider  the  partition  of  the  un¬ 
knowns  V  in  internal  (V,),  master  (Vm)  and  slave  (V*), 
as 

V  =  (v„v^,v,). 

The  slave  unknowns  can  be  expressed  symbolically 
as  functions  of  the  master  unknown  as 

V,  =  G  •  V„ 

or,  for  the  j-th  master-slave  pair  of  nodes  as 
V-^'  =  G^'  •  V-’' 

’  s  *  mi 


where 


0  0 
R  0 
0  1 


(23) 


R  being  the  rotation  tensor  associated  with  the  rotation 
of  the  symmetry  angle  ^l)  about  the  axis  of  rotation. 

The  minimal  set  of  unknowns  V  =  (Vi,  Vm)  is  related 
to  the  redundant  set  V  by 


V  =  r  •  V  = 


I 

0 

0 


(24) 


The  unconstrained  linearized  discrete  equations  of  motion 
read 

J  •  AV  =  r, 

where  J  is  the  tangent  matrix  and  r  is  the  residual  vec¬ 
tor.  Applying  the  transformation  F  to  the  unconstrained 
system  yields  the  constrained  reduced  system 

r^JF  •  AV  =  F^  •  r.  (25) 


Refer  to  [74]  for  implementation  details  of  this  technique. 

5.2,3  Subsonic  and  IVansonic  Hovering  Rotors 

Caradonna  and  Tung  [12]  have  experimentally  investi¬ 
gated  a  model  helicopter  rotor  in  several  subsonic  and 
transonic  hovering  conditions.  These  experimental  tests 
have  been  extensively  used  for  validating  CFD  codes  for 
rotating  wing  analysis.  The  experimental  setup  was  com¬ 
posed  of  a  two-bladed  rotor  mounted  on  a  tall  column 
containing  the  drive  shaft.  The  blades  had  rectangular 
planform,  square  tips  and  no  twist  or  taper,  made  use  of 
NACA0012  airfoil  sections  and  had  an  aspect  ratio  equal 
to  six. 

Figure  61  shows  the  experimental  and  numerical  values 
of  the  pressure  coefficients  at  different  span  locations  for 
three  subsonic  test  cases  investigated  by  Caradonna  and 
Tung,  namely  9^  =  0°  and  Mt  =  0.520,  9c  =  5°  and 
Mi  =  0.434,  9c  =  8°  and  Mt  —  0.439.  The  agreement 
with  the  experimental  data  is  good  at  all  locations,  in¬ 
cluding  the  section  close  to  the  tip.  Only  two  pressure 
distributions  are  presented  for  each  case  for  space  limita¬ 
tions,  however  similar  correlation  with  the  experimental 
data  was  observed  at  all  the  available  locations.  Rela¬ 
tively  crude  meshes  have  been  employed  for  all  the  three 
test  cases,  with  the  coarsest  mesh  of  only  101,000  tetra- 
hedra  being  used  for  the  9c  =  0°  case,  and  the  finest  of 
152,867  tetrahedra  for  the  9c  =  8°  test  problem. 

The  analysis  was  performed  on  32  processing  nodes  of  an 
IBM  SP-2.  Reduced  integration  was  used  for  the  interior 
elements  for  lowering  the  computational  cost,  while  full 
integration  was  used  at  the  boundary  elements  for  better 
resolution  of  the  airloads,  especially  at  the  trailing  edge 
of  the  blade.  The  GMRES  algorithm  with  block-diagonal 
preconditioning  was  employed,  yielding  an  average  num¬ 
ber  of  GMRES  iterations  to  convergence  of  about  10.  The 
analysis  was  advanced  in  time  using  one  single  Newton 
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Figure  61.  Computed  and  experimental  pressure 
coefficients  on  the  blade  at  different  span  locations,  for 
the  three  subsonic  cases  6^  =  0°,  Mt  =  0.520; 

Oc  =  5°,  Mt  =  0.434;  6»c  =  8°,  Mt  =  0.439. 

iteration  per  time  step  and  a  local  time  stepping  strat¬ 
egy  denoted  by  CFL  numbers  ranging  from  10  at  tht 
beginning  of  the  simulation  to  20  towards  convergence 
yielding  a  reduction  in  the  energy  norm  of  the  residual  o1 
almost  four  orders  of  magnitude  in  50  to  60  time  steps 
The  symmetric  form  of  the  least-squares  stabilization  was 
employed,  and  the  discontinuity  capturing  operator  was 
not  activated. 

Figure  62  shows  the  experimental  and  numerical  values 
of  the  pressure  coefficients  for  a  transonic  case  denoted  by 
6c  =  S'*  and  Mt  =  0.877.  The  first  two  plots  of  Figure  62 
present  the  pressure  distributions  obtained  using  an  initial 
crude  grid  consisting  of  142,193  tetrahedra.  Three  levels 
of  adaptivity  were  applied  to  this  grid  in  order  to  obtain 
a  sharper  resolution  of  the  tip  shock,  yielding  a  final 
mesh  characterized  by  262,556  tetrahedra.  The  pressure 
distributions  obtained  with  the  adapted  grid  are  shown  in 
the  third  and  fourth  plots  of  the  same  picture.  Note  that 
the  smearing  present  in  the  first  two  plots  and  due  to  the 
numerical  viscosity  introduced  in  the  formulation  with  the 
purpose  of  stabilizing  it,  has  disappeared.  Consistently 
with  the  nature  of  the  Euler  equations,  the  shocks  appear 
as  jumps  and  are  resolved  in  only  one  or  two  elements. 
Note  also  the  appearance  of  the  analytically  predicted 
overshoot  just  aft  of  the  shock  which  is  typical  of  the 
transonic  Euler  solutions. 

The  effect  of  the  adaptation  of  the  mesh  on  the  resolution 
of  the  shock  is  clearly  demonstrated  in  Figure  63,  where 


Figure  62.  Computed  and  experimental  pressure 
coefficients  on  the  blade,  at  two  different  span  locations 
close  to  the  tip,  9c  =  8°.  —  0.877.  Top  two  plots: 

initial  coarse  142,193  tetrahedron  grid.  Bottom  two  plots: 
adapted  (three  levels)  final  262,556  tetrahedron  grid. 


Figure  63.  Density  isocontour  plots  on  the  upper 
surface  of  the  blade  tip,  9c  =  8°,  Mt  =  0.877.  At  left: 
initial  coarse  grid.  At  right:  final  adapted  grid. 

the  density  isocontour  plots  at  the  upper  tip  surface  are 
presented  for  the  initial  and  adapted  meshes.  The  effect 
noted  in  Figure  62  can  be  more  fully  appreciated  here. 
The  parallel  adaptive  analysis  was  conducted  on  32  pro¬ 
cessing  nodes  with  the  GMRES  algorithm,  using  once 
again  reduced  integration  for  the  interior  elements  and 
full  integration  at  the  boundary  elements.  The  symmetric 
form  of  the  least-squares  stabilization  was  employed,  to¬ 
gether  with  the  discontinuity  capturing  term  for  improved 
shock  confinement.  After  partitioning  of  the  initial  coarse 
mesh  using  the  IRB  algorithm,  the  simulation  was  per¬ 
formed  for  60  implicit  time  steps  with  CFL  condition 
equal  to  10  in  the  initial  20  steps  and  equal  to  15  for 
the  remaining  steps.  The  results  gathered  at  convergence 
were  used  for  computing  an  error  indicator  based  on  den- 
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sity  and  Mach  number,  which  was  employed  for  driving 
the  parallel  adaptation  of  the  mesh.  For  the  new  ver¬ 
tices  created  by  the  adaptation  process,  the  solution  was 
projected  from  the  coarser  mesh  using  simple  edge  in¬ 
terpolation.  The  solution  obtained  in  this  way  was  used 
for  restarting  the  analysis,  which  was  advanced  for  60 
time  steps  with  a  CFL  number  of  15.  Similarly,  a  sec¬ 
ond  adaptation  was  performed,  yielding  the  final  mesh  for 
which  another  40  time  steps  were  performed  at  a  CFL  of 
20,  until  convergence  in  the  energy  norm  of  the  resid¬ 
ual.  The  average  number  of  GMRES  cycles  per  time 
step  throughout  the  analysis  was  8. 

Figure  64  shows  the  mesh  at  the  upper  face  of  the  blade 
tip,  before  and  after  refinement.  The  different  grey  levels 
indicate  the  different  subdomains,  i.e.  elements  assigned 
to  the  same  processing  node  are  denoted  by  the  same  level 
of  grey.  Note  the  change  in  the  shape  of  the  partitions 
from  the  initial  to  the  final  mesh,  change  generated  by  the 
mesh  migration  procedure  for  re-balancing  the  load  after 
the  refinement  procedure  has  modified  the  discretization. 
Note  also  how  the  mesh  nicely  follows  the  shock. 


Figure  64.  Meshes  with  partitions  on  the  upper  surface 
of  the  blade  tip,  6c  =  8°,  Mt  =  0.877.  At  left:  initial 
coarse  grid  with  IRB  partitions.  At  right:  final 
adapted  grid  with  partitions  obtained  by  migration. 

5.3.  Effectiveness  of  Parallel  Adaptive 
Analysis  Procedures 

The  evaluation  of  the  efficiency  and  performance  of  a 
parallel  adaptive  analysis  is  a  task  complicated  by  the 
numerous  aspects  that  must  be  considered.  In  the  follow¬ 
ing  we  will  try  to  address  at  least  some  of  them  with  the 
help  of  a  classical  problem  in  CFD,  namely  that  of  the 
Onera  M6  wing  in  transonic  flight,  that  we  have  used  in 
the  early  stages  of  development  of  our  code  for  validation 
purposes.  This  wing  has  been  studied  experimentally  by 
Schmitt  and  Charpin  [65]  and  it  has  been  employed  by 
numerous  researchers  for  validating  both  structured  and 
unstructured  flow  solvers.  The  wing  is  characterized  by 
an  aspect  ratio  of  3.8,  a  leading  edge  sweep  angle  of 


30°,  and  a  taper  ratio  of  0.56.  The  airfoil  section  is  an 
Onera  D  symmetric  section  with  10%  maximum  thick- 
ness-to-cord  ratio. 

We  consider  a  steady  flow  problem  characterized  by  an 
angle  of  attack  a  =  3.06°  and  a  value  of  M  =  0.8395  for 
the  freestream  Mach  number.  In  such  conditions,  the  flow 
pattern  around  the  wing  is  characterized  by  a  complicated 
double-lambda  shock  on  the  upper  surface  of  the  wing 
with  two  triple  points. 

We  first  address  the  scalability  of  the  parallel  solver  on 
a  fixed  mesh,  i.e.  we  analyze  the  speed-ups  attained  by 
the  code  using  one  single  mesh  and  varying  the  num¬ 
ber  of  processing  nodes.  This  is  a  classical  measure  of 
efficiency,  and  it  is  important  to  show  that  the  imple¬ 
mented  procedure  performs  well  with  respect  to  it  before 
measuring  other  properties  that  are  more  pertinent  to  an 
adaptive  analysis. 

The  simulation  was  performed  using  a  mesh  consisting  of 
128,172  tetrahedra,  using  the  matrix-free  GMRES  algo¬ 
rithm  with  reduced  integration  of  the  interior  elements  and 
full  integration  of  the  boundary  elements.  A  local  time 
stepping  strategy  was  employed  with  one  single  Newton 
iteration  per  time  step,  using  a  CFL  condition  of  5  in  the 
first  20  time  steps  and  a  CFL  equal  to  10  for  another  80 
time  steps,  attaining  a  drop  in  the  residual  of  three  orders 
of  magnitude.  The  mesh  was  partitioned  using  a  paral¬ 
lel  implementation  of  the  IRB  algorithm.  The  time  for 
partitioning,  even  if  small  when  compared  with  the  time 
needed  for  achieving  convergence  in  the  finite  element 
analysis,  is  not  considered  in  the  following.  The  analysis 
was  run  on  4,  8,  16,  32,  64,  128  processors  of  an  IBM 
SP-2  and  the  results  are  presented  in  Figure  65  in  terms 
of  the  inverse  of  the  wall  clock  time  versus  the  number 
of  processing  nodes.  The  highly  linear  behavior  of  the 
parallel  algorithm  shows  the  excellent  characteristics  of 
scalability  of  the  code. 


procs 


Figure  65.  Parallel  efficiency  evaluated  at 
fixed  mesh  for  the  Onera  M6  wing  in  transonic 
flight.  128,172  tetrahedra,  IRB  partitions. 

The  same  problem  was  then  adaptively  solved  in  order  to 
more  accurately  resolve  the  complicated  features  of  the 
flow.  An  initial  coarse  mesh  of  85,567  tetrahedra  was  par¬ 
titioned  with  the  IRB  algorithm  on  32  processing  nodes 


6-44 


and  the  analysis  was  carried  on  to  convergence  as  previ¬ 
ously  explained.  The  results  obtained  were  then  used  for 
computing  an  error  indicator  based  on  density  and  Mach 
number,  which  was  employed  for  performing  a  first  level 
of  refinement,  bringing  the  mesh  to  131,000  tetrahedra. 
The  solution  was  projected  on  the  new  vertices  using  a 
simple  edge  interpolation  technique,  and  the  analysis  was 
then  performed  on  the  refined  mesh  for  80  time  steps  at  a 
CFL  number  of  10.  Similarly,  other  two  levels  of  refine¬ 
ment  followed  by  subsequent  analysis  were  performed, 
obtaining  an  intermediate  223,499  tetrahedron  mesh  and 
a  final  388,837  tetrahedron  mesh. 

Figure  66  shows  the  density  isocontour  plots  on  the  upper 
surface  of  the  wing  corresponding  to  the  initial  and  the 
final  mesh  discretizations.  Note  that  the  forward  shock 
is  barely  visible  in  the  results  obtained  with  the  initial 
coarse  mesh,  the  aft  shock  presents  significant  smearing 
and  the  lambda  shock  located  at  the  tip  of  the  wing  is 
not  resolved.  As  expected,  considerable  improvement  in 
the  resolution  of  the  shocks  can  be  observed  when  mesh 
adaptation  is  employed. 

Figure  67  shows  the  initial  and  final  meshes.  Once  again, 
elements  assigned  to  the  same  subdomains  are  denoted 
by  the  same  grey  level.  For  the  final  mesh,  the  partitions 
shown  are  those  obtained  with  the  iterative  load  balancing 
algorithm. 

The  fact  that  the  analysis  is  conducted  in  parallel  doesn’t 
modify  the  convergence  characteristics  of  a  classical  h 
refinement  technique,  such  as  the  one  considered  here. 
However,  while  in  a  serial  environment  essentially  only 
the  accuracy  of  the  solution  versus  the  size  of  the  prob¬ 
lem  and  its  computational  cost  enter  into  the  picture, 
in  a  parallel  environment  other  factors  must  be  consid¬ 
ered.  In  particular,  we  consider  here  the  evolution  dur¬ 
ing  the  analysis  of  two  fundamental  parameters:  (i)  the 
surface-to-volume  ratio  for  the  subdomains,  and  (ii)  the 
number  of  neighbors  of  each  subdomain.  The  first  of 
these  two  parameters  essentially  dominates  the  volume 
of  communication  in  terms  of  the  size  of  the  messages 
to  exchange,  while  the  second  parameter  dominates  the 
number  of  messages  that  each  processor  must  send  and 
receive. 

In  a  parallel  adaptive  environment,  the  issue  is  then: 
given  certain  repartitioning  algorithms,  which  is  the  qual¬ 
ity  of  the  partitions  that  they  produce  compared  to  their 
relative  cost?  It  is  well  known  that  certain  classes  of 
partitioning  algorithms,  such  as  the  Spectral  Bisection 
method,  produce  very  high  quality  partitions.  However, 
the  cost  associated  with  spectrally  bisecting  increasingly 
larger  meshes  during  an  adaptive  analysis  would  be  pro¬ 
hibitive.  Therefore  in  this  work  we  consider  two  rela¬ 
tively  low  cost  approaches  to  the  problem,  the  previously 
mentioned  parallel  IRB  repartitioning  and  the  iterative 
load  migration  scheme. 

Two  distinct  runs  were  made,  the  only  difference  between 
them  being  the  repartitioning  strategy  adopted.  In  both 
cases,  all  the  stages  of  the  analysis  — initial  IRB  parti¬ 
tioning,  flow  solution,  error  sensing,  adaptation  and  load 


Figure  66.  Onera  M6  wing  in  transonic 
flight,  a  =  3.06°,  M  =  0.8395.  Density 
isocontour  plots  for  the  initial  and  final  meshes. 

balancing —  were  performed  automatically  in  parallel  on 
32  processing  nodes,  i.e.  without  ever  leaving  the  parallel 
environment.  The  load  balancing  algorithm  was  activated 
three  times  during  the  adaptation  of  each  of  the  meshes, 
after  the  refinement,  after  the  snapping  of  the  newly  gen¬ 
erated  vertices  to  the  curved  boundaries  of  the  model  and 
after  the  local  retriangulation^.  At  every  call,  the  algo¬ 
rithm  was  requested  to  perform  only  approximately  eight 
migration  iterations,  yielding  a  maximum  out  of  balance 
number  of  elements  per  processing  node  equal  to  one  at 
the  end  of  each  refinement  level.  This  strategy  allows 
better  efficiency  of  the  various  stages  of  the  adaptive  al- 

^  We  remark  that  in  the  current  implementation,  snapping 
can  also  cause  load  imbalance  since  it  makes  use  of 
local  triangulation. 
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Figure  67.  Onera  M6  wing  in  transonic  flight, 
a  =  3.06°,  M  =  0.8395.  Initial  and  final  meshes. 
Grey  levels  indicate  processor  assignment. 

gorithm  that  can  then  operate  on  balanced  or  nearly  bal¬ 
anced  meshes.  This  “incremental”  rebalancing  capability 
represents  a  nice  advantage  of  the  iterative  load  balancing 
scheme  over  other  algorithms.  The  parallel  repartitioning 
algorithm  was  instead  activated  just  once  at  the  end  of 
each  adaptive  step. 

The  meshes  obtained  during  the  two  previously  men¬ 
tioned  parallel  adaptive  simulations  of  the  Onera  M6  wing 
were  analyzed  for  gathering  data  on  the  overall  perfor¬ 
mance  of  the  analysis.  Figure  68  reports  plots  of  the 
boundary  faces  and  neighbor  statistics.  The  quantities 
plotted  are  defined  as: 

(i)  Surface-to-volume  measures; 

'S'max  =  (Boundary  Faces^ /Faces;), . 


5giob  =  Boundary  Faces/Faces. 

(ii)  Neighbor  measures: 

■^max  =  max(Neighbors;/(Procs  -  1)), 

i 

Ns.vig  —  Neighbors; /(Procs  -  l))/Procs. 

i 

All  these  quantities  are  reported  in  Figure  68  versus  the 
number  of  tetrahedra  in  the  mesh  at  a  certain  adaptive 
level  normalized  by  the  number  of  tetrahedra  in  the  initial 
mesh.  The  solid  line  represents  the  values  of  the  parame¬ 
ters  obtained  for  the  parallel  adaptive  analysis  where  the 
iterative  mesh  migration  procedures  were  employed.  The 
dashed  line  corresponds  to  the  parallel  adaptive  analysis 
where  the  refined  meshes  were  repartitioned  after  each 
adaptive  step  using  the  parallel  IRB  algorithm. 
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Figure  68.  Boundary  faces  and  neighbor  statistics  for 
the  parallel-adaptive  analysis  of  the  Onera 
M6  wing  in  transonic  flight  using  the  mesh 
migration  and  IRB  rebalancing  schemes. 

From  the  analysis  of  the  first  two  plots  at  the  top  of 
Figure  68,  it  is  clear  that  the  migration  procedures  im¬ 
plemented  in  this  work  control  very  effectively  the  sur- 
face-to-volume  ratios,  which  in  fact  remain  constant  and 
fairly  similar  to  the  ones  obtained  with  the  IRB  parti¬ 
tioning  for  the  whole  simulation.  On  the  other  hand,  the 
second  two  plots  of  the  same  figure  show  that  the  num¬ 
ber  of  neighbors  of  each  subdomain  tends  to  increase  with 
the  number  of  adaptive  steps  performed.  A  more  detailed 
analysis  shows  that  in  general  each  subdomain  is  con¬ 
nected  by  a  significant  amount  of  mesh  entities  (vertices, 
faces,  edges)  only  with  a  reduced  number  of  neighbors, 
while  it  shares  a  very  limited  number  of  mesh  entities 
with  the  other  neighbors.  We  are  currently  investigating 
ways  of  removing  such  small  contact  area  interconnec¬ 
tions,  in  order  to  achieve  a  better  control  on  the  number 
of  neighbors. 

The  different  partition  statistics  provided  by  the  two  re¬ 
balancing  algorithms  and  shown  in  the  previous  figure 
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clearly  have  an  impact  on  the  performance  of  the  flow 
solver.  For  example,  the  ratio  of  the  wall  clock  tim¬ 
ings  for  the  flow  solutions  performed  on  the  final  adapted 
mesh  was  found  to  be  0.83,  in  favor  of  the  repartitioning 
algorithm.  It  should  be  pointed  out  that  this  is  not  an  ob¬ 
jective  measure  of  efficiency  of  the  rebalancing  strategy, 
in  the  sense  that  it  depends  on  the  algorithm  used  for  the 
flow  solution.  On  the  contrary,  ‘5'giob.  -/Vmax  and 

are  objective  measures. 

The  two  approaches  were  also  compared  in  terms  of  rela¬ 
tive  wall  clock  timing  cost.  The  repartitioning  algorithm 
outperformed  the  migration  scheme  at  each  adaptive  step. 
The  ratio  of  the  iterative  migration  wall  clock  time  to  the 
rebalancing  wall  clock  time  was  found  to  be  4.07  at  the 
first  level  (131,000  tetrahedron  mesh),  4.41  at  the  sec¬ 
ond  (223,1499  tetrahedron  mesh)  and  2.21  at  the  third 
(388,837  tetrahedron  mesh). 

These  preliminary  test  results  seem  to  indicate  that  the 
iterative  load  migration  scheme  tends  to  be  more  compu¬ 
tationally  expensive  than  the  parallel  IRB  algorithm,  and 
at  the  same  time  does  not  yield  the  same  quality  of  the 
partitions,  at  least  with  the  currently  implemented  heuris¬ 
tics.  However,  it  must  not  be  forgotten  that  these  tests  are 
certainly  not  as  exhaustive  as  one  might  desire  for  ruling 
in  favor  of  one  approach  over  the  other.  Moreover,  it  is 
clear  that  this  result  is  partially  due  to  the  low  cost  of  the 
IRB  partitioning,  and  comparing  the  migration  scheme 
with  other  more  expensive  partitioning  algorithms  might 
lead  to  opposite  conclusions.  For  example,  if  an  algo¬ 
rithm  with  better  control  over  the  number  of  neighbors 
could  be  devised,  then  the  migration  scheme  used  in  con¬ 
junction  with  a  high  quality  initial  partition  (such  as  the 
one  provided  by  a  spectral  partitioning)  could  yield  an 
overall  better  performance  than  a  repartitioning  scheme. 
A  more  complete  analysis  of  the  relative  merits  of  the 
two  approaches  will  be  the  subject  of  future  work. 

6.  Closing  Remarks 

This  paper  has  presented  progress  made  to  date  on  the 
development  of  parallel  automated  adaptive  analysis  pro¬ 
cedures  for  unstructured  meshes  which  operate  on  dis¬ 
tributed  memory  MIMD  computers.  The  procedures  pre¬ 
sented  allow  for  the  reliable  analysis,  through  the  use 
of  automated  adaptive  analysis,  of  large  problems  which 
can  only  be  supported  by  the  computational  power  of 
parallel  computers.  Specific  emphasis  was  placed  on  the 
techniques  needed  to  effectively  support  evolving  meshes 
such  that  computational  load  balance  was  maintained 
throughout  the  simulation  process. 
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Introduction  and  Overview 

The  intent  of  these  notes  is  to  review  several 
basic  algorithms  and  procedures  used  in  com¬ 
putational  fluid  dynamics  (CFD)  with  emphasis 
on  algorithms  suitable  to  parallel  computing  en¬ 
vironments.  In  particular,  we  will  concentrate 
on  numerical  methods  in  CFD  which  require  the 
formation  and  solution  of  large  sparse  linear  sys¬ 
tems  of  algebraic  equations.  These  matrices  will 
arise  from  the  discretization  of  the  Navier-Stokes 
equations  which  govern  compressible  fluid  flow. 
From  this  point  of  view,  a  large  portion  of  these 
notes  addresses  algorithms  used  in  the  forma¬ 
tion,  manipulation,  and  solution  of  sparse  ma¬ 
trices  on  serial  and  parallel  computers. 

Chapter  1  begins  by  considering  the  task  of  or¬ 
dering  (numbering)  vertices  of  an  unstructured 
mesh.  Good  vertex  orderings  can  greatly  im¬ 
prove  the  efficiency  and  memory  storage  required 
in  many  sparse  matrix  algorithms.  For  example, 
techniques  for  iterative  matrix  solution  some¬ 
times  exploit  incomplete  matrix  factorizations. 
The  quality  of  these  factorizations  usually  de¬ 
pends  on  the  ordering  of  matrix  unknowns  and 
consequently  mesh  vertices.  Next,  we  review  the 
mesh  partitioning  problem.  Three  simple  proce¬ 
dures  for  decomposing  an  arbitrary  triangulated 
domain  into  a  specified  number  of  subdomains 
are  discussed.  Each  subdomain  may  then  be 
placed  on  an  individual  processor  of  the  paral¬ 
lel  computer.  Communication  between  proces¬ 
sors  is  accomplished  using  message  packet  ex¬ 
changes.  This  computational  model  places  de¬ 
mands  on  the  partitioning  algorithms  so  that 
computational  work  is  evenly  distributed  (bal¬ 
anced)  while  requiring  minimal  communication 
among  processors. 

In  Chapter  2  we  turn  to  the  compressible 
Navier-Stokes  equations.  These  equations  repre¬ 
sent  conservation  principles  for  mass,  momenta, 
and  energy  of  a  Newtonian  fluid.  In  high 


speed  aerodynamic  applications,  the  effects  of 
turbulence  are  very  important  and  must  either 
be  accurately  computed  or  approximately  mod¬ 
eled.  This  increases  the  difficulty  and  complex¬ 
ity  of  solving  the  Navier-Stokes  equations.  In 
the  present  applications,  a  one-equation  turbu¬ 
lence  model  equation  is  added  to  the  basic  time- 
averaged  Navier-Stokes  equations.  The  result¬ 
ing  system  of  coupled  integral  equations  are  dis¬ 
cretized  using  a  finite- volume  technique  based  on 
linear  least  squares  reconstruction.  This  yields  a 
system  of  nonlinear  coupled  algebraic  equations 
which  are  solved  via  Newton  iteration.  The  most 
difficult  task  in  Newton’s  method  is  the  solution 
of  the  resulting  sequence  of  large  sparse  linear 
matrix  problems.  Iterative  methods  based  on 
preconditioned  bi-conjugate  gradient  and  gener¬ 
alized  minimum  residual  iterations  are  consid¬ 
ered.  Numerical  examples  are  then  shown  to 
demonstrate  the  convergence  characteristics  of 
the  uniprocessor  algorithm. 

Chapter  3  focuses  on  domain  decomposed  vari¬ 
ants  of  the  uniprocessor  CFD  algorithm  given  in 
Chapter  2.  As  a  starting  point,  the  Schwarz  do¬ 
main  decomposition  algorithm  for  eUiptic  equa¬ 
tions  is  reviewed.  This  technique  requires  the 
isolated  solution  of  subdomain  problems.  Next 
we  derive  the  well-known  relationship  between 
convergence  rate  of  the  Schwarz  algorithm  and 
overlap  of  subdomains.  This  analysis  reveals 
that  special  care  must  be  taken  to  insure  that 
the  domain  decomposition  procedure  does  not 
become  iU-conditioned  as  the  number  of  sub- 
domains  is  increased.  The  Schwarz  algorithm 
can  also  be  applied  to  the  solution  of  nonelliptic 
equations.  Computations  of  inviscid  and  viscous 
fluid  flow  are  shown  to  demonstrate  the  favorable 
effect  of  increasing  subdomain  overlap  on  conver¬ 
gence  of  the  Schwarz  algorithm.  In  these  compu¬ 
tations,  each  subdomain  is  independently  solved 
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using  the  Newton  algorithm  given  in  Chapter 
2.  An  alternative  to  the  conventional  domain 
decomposition  procedure  is  the  Newton- Krylov 
technique  with  the  overlapping  Schwarz  method 
used  to  precondition  the  underlying  global  ma¬ 
trix  problems.  In  viscid  and  viscous  computa¬ 
tions  are  shown  to  demonstrate  the  efficiency  of 
this  method. 

Finally,  Chapter  4  presents  some  selected  com¬ 
putations  performed  on  the  IBM  SP2  parallel 
computer  located  at  NASA  Ames. 


7-3 


Chapter  1 

Graph  Ordering  and  Partitioning 
Algorithms  for  CFD 


In  this  section  we  review  a  few  basic  graph  al¬ 
gorithms  which  are  frequently  used  in  numerical 
computations  performed  on  parallel  computers. 

1.1  Graph  Ordering 

The  particular  ordering  of  mesh  (graph)  ver¬ 
tices  can  sometimes  alter  the  amount  of  com¬ 
putational  effort  and  memory  storage  required  in 
solving  sparse  matrix  problems.  In  sparse  matrix 
L  —  U  factorization,  the  number  of  fill  elements 
produced  during  factorization  is  dependent  on 
the  ordering  of  equations.  Good  ordering  algo¬ 
rithms  attempt  to  reduce  the  number  of  fill  ele¬ 
ments  produced  during  factorization.  Similarly, 
the  quality  of  inexact  factorizations  such  as  in¬ 
complete  Cholesky  and  incomplete  L  —  U  factor¬ 
ization  also  depends  on  the  ordering  of  matrix 
unknowns.  Reordering  vertices  can  also  lead  to 
improved  processor  efficiency  by  reducing  “cache 
misses”,  an  important  consideration  for  compu¬ 
tations  performed  on  workstation  class  comput¬ 
ers.  In  parallel  computation,  ordering  algorithms 
can  be  used  as  means  for  partitioning  a  mesh 
among  processors  of  the  computer.  This  last 
consideration  wiU  be  addressed  in  a  later  section. 

In  this  section  we  review  the  CuthiU-McKee 
[CM69]  ordering  algorithm.  This  popular  algo¬ 
rithm  is  simple  yet  surprisingly  effective.  Other 
popular  ordering  strategies  which  deserve  atten¬ 
tion  but  are  not  discussed  here  include  the  mini¬ 
mum  degree  algorithm  [GL81]  and  Rosen’s  algo¬ 
rithm  [Ros68]  for  bandwidth  reduction.  We  be¬ 


gin  a  discussion  of  the  Cuthill- McKee  algorithm 
by  simply  stating  the  procedure. 

Algorithm:  Graph  ordering,  Cuthill-McKee. 

Step  1.  Find  vertex  with  lowest  degree.  This  is 
the  root  vertex. 

Step  2.  Find  aU  neighboring  vertices  connecting 
to  the  root  by  incident  edges.  Order  them  by 
increasing  vertex  degree.  This  forms  level  1. 

Step  3.  Form  level  k  by  finding  all  neighbor¬ 
ing  vertices  of  level  k  —  1  which  have  not  been 
previously  ordered.  Order  these  new  vertices  by 
increasing  vertex  degree. 

Step  4-  If  vertices  remain,  go  to  step  3. 

The  heuristics  behind  the  Cuthill-McKee  al¬ 
gorithm  are  very  simple.  In  the  graph  of  a  ma¬ 
trix,  neighboring  vertices  must  have  numberings 
which  are  near  by,  otherwise  they  wiU  produce 
entries  in  the  matrix  with  large  band  width.  The 
idea  of  sorting  elements  among  a  given  level  is 
based  on  the  heuristic  that  vertices  with  high  de¬ 
gree  should  be  given  indices  as  large  as  possible 
so  that  they  will  be  as  close  as  possible  to  ver¬ 
tices  of  the  next  level  generated.  Figures  1.1  and 
1.2  show  the  dramatic  improvement  in  matrix 
bandwidth  achieved  using  the  CuthiU-McKee  al¬ 
gorithm. 

Studies  of  the  CuthiU-McKee  algorithm  have 
shown  that  the  fiU  characteristics  of  a  matrix 
during  L  —  U  decomposition  can  be  greatly  re¬ 
duced  by  reversing  the  ordering  of  the  CuthiU- 
McKee  algorithm,  see  George  [Geo71].  This 
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where  n  is  the  size  of  the  matrix.  While  this 
does  not  change  the  bandwidth  of  the  matrix, 
it  can  dramatically  reduce  the  fill  that  occurs 
in  Cholesky  or  L  —  U  matrix  factorization  when 
compared  to  the  original  Cuthill-McKee  order¬ 
ing. 

1.2  Graph  Bisection  and  Mesh 
Partitioning 

An  efficient  partitioning  of  a  mesh  for  distributed 
memory  computation  is  one  that  ensures  an  even 
distribution  of  computational  workload  among 
the  processors  and  minimizes  the  amount  of  time 
spent  in  interprocessor  communications.  The 
former  requirement  is  termed  load  balancing.  For 
if  the  load  were  not  evenly  distributed,  some  pro- 
Figure  1.1:  Nonzero  matrix  elements  produced  cessors  will  have  to  sit  idle  at  synchronization 
by  a  Laplacian  discretization  (left)  on  the  trian-  points  waiting  for  other  processors  to  catch  up. 

The  second  requirement  comes  from  the  fact  that 
communication  between  processors  takes  time 
and  it  is  not  always  possible  to  hide  this  latency 
in  data  transfer.  The  actual  cost  of  communi¬ 
cation  can  often  be  accurately  modeled  by  the 
linear  relationship: 

Cost  =  0-1-  j3m 

where  a  is  the  time  required  to  initiate  a  mes¬ 
sage,  I3  is  the  rate  of  data-transfer  between  two 
processors  and  m  is  the  message  length.  For  n 
messages,  the  cost  would  be 

Cost  =  ^(a  +  /3mn). 

n 

This  cost  can  be  reduced  in  two  ways:  (1)  re¬ 
duce  the  number  of  messages  n,  (2)  reduce  the 
size  of  each  message  m.  Consider  the  partition¬ 
ing  shown  in  Figure  1.3.  The  left  figure  requires 
3  pairwise  communication  messages  of  length  5 
while  the  right  figure  requires  4  pairwise  mes¬ 
sages  of  length  2  and  2  pairwise  messages  of 
length  1.  The  choice  of  partitioning  depends 
critically  on  the  hardware  dependent  constants 
a  and  /3. 

In  practice,  it  is  difficult  to  partition  an  un¬ 
structured  mesh  while  simultaneously  minimiz¬ 
ing  the  number  and  length  of  messages.  In  the 


gulated  domain  shown  in  Figure  1.4. 


Figure  1.2:  Nonzero  matrix  elements  after 
Cuthill-McKee  reordering  (right). 

amounts  to  a  renumbering  given  by 
k  n  —  k  +  1 
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(«)  (b) 

Figure  1.3:  (a)  Mesh  partitioning  with  mini¬ 
mized  number  of  messages,  (b)  Mesh  with  mini¬ 
mized  message  length. 


following  paragraphs,  a  few  of  the  most  popu¬ 
lar  partitioning  algorithms  which  approximately 
accomplish  this  task  will  be  discussed.  All  the  al¬ 
gorithms  discussed  below:  coordinate  bisection, 
Cut  hill- McKee,  and  spectral  partitioning  are  dis¬ 
cussed  in  the  paper  by  Venkatakrishnan,  Simon, 
and  Barth  [VSB92].  This  paper  evaluates  the 
partitioning  techniques  within  the  confines  of  an 
explicit,  unstructured  finite- volume  Euler  solver. 
Spectral  partitioning  has  been  extensively  stud¬ 
ied  by  Simon  [Sim91]  for  other  applications.  Al¬ 
though  we  restrict  our  discussion  to  partitioning 
planar  triangulations,  all  of  the  algorithms  dis¬ 
cussed  below  extend  naturally  to  arbitrary  ceU 
complexes  and  higher  space  dimensions. 

In  the  following  sections,  we  consider  mesh 
partitioning  via  recursive  application  of  graph 
bisection.  The  mesh  is  first  divided  into  two  sub¬ 
meshes  of  nearly  equal  size.  Each  of  these  sub¬ 
meshes  is  subdivided  into  two  more  sub-meshes 
and  the  process  in  repeated  until  the  desired 
number  of  partitions  p  is  obtained  (p  is  a  inte¬ 
ger  power  of  2).  In  many  appUcations  it  makes 
sense  to  partition  mesh  cells  such  that  parti¬ 
tion  boundaries  correspond  to  edges  of  the  mesh. 
This  can  be  viewed  as  a  vertex  partitioning  of  the 
graph  dual  to  the  cell  complex,  see  for  example 
Figures  1.4  and  1.5.  In  this  way,  dual  graph  ver¬ 
tices  are  associated  with  mesh  cells. 


Figure  1.4:  Typical  triangulation  for  a  square¬ 
shaped  domain. 


Figure  1.5:  Geometric  dual  of  previous  triangu¬ 
lation  for  a  square-shaped  domain. 

1.2.1  Recursive  Coordinate  Bisection 

In  the  coordinate  bisection  algorithm,  graph  ver¬ 
tex  coordinates  are  sorted  either  horizontally  or 
vertically  depending  of  the  current  level  of  the 
recursion.  A  separator  is  chosen  which  balances 
the  number  of  vertices.  Vertices  are  then  2- 
colored  depending  on  which  side  of  the  separator 
they  are  located. 

Figure  1.6  shows  the  recursive  coordinate  bisec¬ 
tion  of  a  multi-element  airfoil  geometry.  In  this 
example,  the  dual  graph  of  the  triangulation  has 
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Figure  1.6:  Recursive  coordinate  bisection  parti¬ 
tioning  of  multi-element  airfoil  mesh. 


been  used  for  partitioning  with  dual  graph  ver¬ 
tices  assigned  the  centroid  coordinates  of  cells  in 
the  triangulation  plane.  The  recursive  coordi¬ 
nate  partitioning  is  very  efficient  to  create  but 
gives  sub-optimal  performance  on  parallel  com¬ 
putations  owing  to  the  long  message  lengths  than 
can  routinely  occur. 

1.2.2  Recursive  Cuthill-McKee  Bisec¬ 
tion 

The  Cuthill-McKee  algorithm  described  earlier 
can  also  be  used  for  recursive  mesh  partition¬ 
ing.  In  this  case,  the  CuthiU-McKee  level  struc¬ 
ture  is  used  to  2-color  vertices  of  the  graph. 
A  separator  is  chosen  either  at  the  median 
of  the  level  structure  ordering  or  at  the  level 
set  boundary  closest  to  the  median.  This  lat¬ 
ter  technique  has  the  desired  effect  of  reducing 
the  number  of  disconnected  sub-graphs  that  oc¬ 
cur  during  the  recursive  partitioning.  Figure 
1.7  shows  a  Cuthill-McKee  partitioning  for  the 
multi-element  airfoil  mesh.  The  Cuthill-McKee 
ordering  tends  to  produce  long  boundaries  be¬ 
cause  of  the  way  that  the  ordering  is  propagated 
through  a  mesh.  The  number  of  communica¬ 


tion  messages  required  to  exchange  boundary  in¬ 
formation  tends  to  be  higher  using  the  Cuthill- 
McKee  algorithm  when  compared  to  the  coordi¬ 
nate  bisection  algorithm.  The  results  shown  in 
[VSB92]  for  multi-element  airfoil  grids  indicate 
an  overall  performance  on  parallel  computations 
which  is  slightly  worse  than  the  coordinate  bi¬ 
section  technique. 


Figure  1.7:  Recursive  CuthiU-McKee  bisection 
partitioning  of  multi-element  airfoil  mesh. 


1.2.3  Recursive  Spectral  Bisection 

The  last  partitioning  algorithm  considered  is 
the  spectral  bisection  algorithm  [PSL90]  [Sim91] 
[VSB92]  [BS93]  [HL95].  This  algorithm  deter¬ 
mines  a  2-color  bisection  of  a  graph  such  that 
the  cut-weight,  Wc,  is  approximately  minimized. 
The  cut-weight  of  a  graph  is  defined  as  the  sum 
of  edge  weights  for  all  edges  with  vertices  of  dis¬ 
joint  color.  For  simplicity,  we  will  consider  un¬ 
weighted  (unit  edge  weight)  graphs.  The  prob¬ 
lem  of  minimizing  the  cut- weight  of  a  graph  sub¬ 
ject  to  the  constraint  that  the  number  of  ver¬ 
tices  is  balanced  is  related  to  a  simpler  problem 
in  graph  bisection  which  is  known  to  be  np-hard 
[GJS76].  The  spectral  bisection  algorithm  can  be 
seen  as  an  algorithm  for  approximately  solving 
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this  np-hard  combinatorial  problem  by  solving 
a  continuous  (hopefuUy  nearby)  problem.  The 
algorithm  consists  of  the  following  steps: 

Algorithm:  Spectral  Graph  Bisection. 

Step  1.  Calculate  the  matrix  C  associated  with 
the  Laplacian  of  the  graph. 

Step  2.  Calculate  the  eigenvalues  and  eigenvec¬ 
tors  of  jC. 

Step  3.  Order  the  eigenvalues  by  magnitude, 
-^1  <  ^2  <  Xs— in¬ 
step  4-  Determine  the  smallest  nonzero  eigen¬ 
value,  A/  and  its  associated  eigenvector  x/  (the 
Fiedler  vector). 

Step  5.  Sort  elements  of  the  Fiedler  vector. 

Step  6.  Choose  a  divisor  at  the  median  of  the 
sorted  list  and  2-color  vertices  of  the  graph  which 
correspond  to  elements  of  the  Fielder  vector  less 
than  or  greater  than  the  median  value. 


Figure  1.8:  Recursive  spectral  bisection  parti¬ 
tioning  of  multi-element  airfoil  mesh. 


The  spectral  partitioning  of  the  multi-element 
airfoil  is  shown  in  Figure  1.8.  In  [VS B 92]  it  was 
observed  that  superior  performance  was  attained 
for  parallel  flow  fleld  computations  using  spectral 
partitioning.  The  cost  of  the  spectral  partition¬ 
ing  is  high  even  using  a  Lanczos  algorithm  to 


compute  the  eigenvalue  problem.  Recently,  this 
cost  has  been  reduced  by  the  use  of  a  multilevel 
Lanczos  algorithm  as  discussed  in  [BS93]. 

The  spectral  partitioning  exploits  a  peculiar 
property  of  the  “second”  eigenvector  of  the 
Laplacian  matrix  associated  with  a  graph.  Con¬ 
sider  a  the  graph  G  =  {V,E)  consisting  of  n  ver¬ 
tices  and  m  edges.  The  Laplacian  matrix  of  a 
graph  C  G  is  given  by 

£  =  _X)  -F  A. 


where  A  G  3?”^” 
trix 


Aij  — 


is  the  standard  adjacency  ma- 


1  e{vi,vj)e  G 
0  otherwise 


(1.1) 


and  P  is  a  n  X  n  diagonal  matrix  with  entries 
equal  to  the  degree  of  each  vertex,  P,-  =  d{vi). 
Alternatively,  the  Laplacian  of  a  graph  can  be 
written  in  terms  of  the  rectangular  incidence  ma¬ 
trix  C  G 


Cu  = 


—  1  if  Vi  is  the  origin  of  edge  I 
1  if  Vi  is  the  destination  of  edge  /  . 

0  otherwise 


(1.2) 

Using  the  incidence  matrix,  the  Laplacian  of  the 
graph  is  given  by 


£  =  CC^  (1.3) 

Multiplication  of  times  a  vector  x  £ 
is  equivalent  to  differencing  vertex  values  of  x 
across  each  edge  so  that 

(xj  -  Xi^  =  x^CC^x  =  x^Cx  (1.4) 

e{vi,Vj)eE 


This  provides  an  easy  way  to  verify  the  sym¬ 
metry  and  positive  semi-definiteness  of  £.  Also 
from  the  above  definitions,  it  should  be  clear  that 
rows  of  £  each  sum  to  zero.  Define  the  summa¬ 
tion  vector  s  G  3?",  s  =  [1,1,1,...]^.  By  con¬ 
struction  we  have  that  £s  =  0.  This  means  that 
at  least  one  eigenvalue  is  zero  with  s  as  an  eigen¬ 
vector.  To  understand  the  spectral  bisection  al¬ 
gorithm,  define  a  partitioning  vector  p  G  3?" 
which  2-colors  the  vertices  of  a  graph 

p=  [-fl,-l,-l,-fl,-hl,...,-f-l,-l]^  (1.5) 
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depending  on  the  sign  of  elements  of  p  and  the 
one-to-one  correspondence  with  vertices  of  the 
graph,  see  for  example  Figure  1.9.  A  critical  ob¬ 
servation  is  that  the  cut-weight  can  be  expressed 
in  terms  of  the  partitioning  vector  p  and  the 
Laplacian  of  the  graph  by  the  following  easily 
verified  formula 

Wc  =  ]p^Cp.  (1.6) 

4 


Figure  1.9;  Arbitrary  graph  with  2-coloring 
showing  separator  and  cur  edges  (left) 


The  objective  of  the  spectral  bisection  algorithm 
is  to  determine  a  balanced  2-color  partitioning  of 
each  connected  component  of  the  graph  such  that 
the  number  of  edges  cut  by  the  partition  boundary 
(the  cut-weight)  is  approximately  minimized. 
Using  the  cut- weight  formula,  we  can  succinctly 
state  the  discrete  bisection  problem: 

Discrete  Bisection  Problem  (np-hard) 


minimize  ^p^ Cp  (ininimize  cut  -  weight) 

subject  to  (1-7) 

p^ s  =  0  (balanced  partitioning) 

=  ±1  (discrete  partitioning  vector) 


In  the  spectral  bisection  method  the  discrete  np- 
hard  problem  is  replaced  by  a  simpler  continu¬ 
ous  minimization  problem.  The  constraint  that 
p  take  on  integer  values  ±1  is  removed  and  re¬ 
placed  with  a  normaUzation  condition  on  a  con¬ 
tinuous  partitioning  vector. 

Continuous  Bisection  Problem  (a:  G  3?") 


...  J-  Tn 
minimize  --x  Lx 
4 

subject  to 
x^  s  =  0 


T 

x  X  =  n 


C  minimize  continuous 
V  cut  —  weight 

(balanced  partitioning) 
(normalization)  (1-8) 


After  solving  the  continuous  bisection  problem 
(exactly),  the  partitioning  vector  p  is  obtained 
using  discrete  approximation: 

p^’^  =  sign(a;^*^).  (discrete  approximation) 

(1:9) 

It  is  the  replacement  of  the  discrete  partitioning 
vector  by  a  continuous  counterpart  followed  by 
discrete  approximation  which  makes  the  spectral 
bisection  algorithm  approximate. 

The  solution  to  the  continuous  bisection  prob¬ 
lem  has  a  well-known  (exact)  solution  in  terms  of 
the  eigenvector  associated  with  the  first  nonzero 
eigenvalue.  To  show  this  consider  the  spectral 
decomposition  of  £, 


n 

£  =  Ai  yiyj,  0  <  A,-  <  Aj  i  <  j  (1.10) 

i=l 

where  A,  G  3?  and  yi  G  3?”  denote  the  eigenvalues 
and  orthonormal  eigenvectors  of  £.  For  ease  of 
exposition,  assume  the  graph  consists  of  a  sin¬ 
gle  connected  component  and  let  A2  denote  the 
first  nonzero  eigenvalue.  The  cut-weight  of  the 
continuous  problem  is  given  by 


4Wc  =  x'^Cx  =  Yj  (1-11) 

i=2 


Since  the  orthonormal  eigenvectors  completely 
span  all  space  in  3?"  and  yi  =  s,  we  can  expand 
X  (suitably  normahzed)  in  terms  of  the  remaining 
eigenvectors  vectors 
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with  J27=2Pi  =  1  YZ=3(^i  =1-  A  direct 

computation  yields 

n 

4Wc  =  n(l  -  a)A2  +  na  ^  tr^A,-  (1-13) 

»=3 

which  is  minimized  when  a  =  0  so  that  the  so¬ 
lution 

,  X  =  ‘n}^^y2 

satisfies  the  continuous  bisection  problem  with  a 
lower  bound  cut-weight  estimate  of 


Figure  1.10  shows  contours  of  the  second  eigen¬ 
vector  for  a  multi-element  airfoil  mesh. 


Figure  1.10:  Contours  of  Fiedler  Vector  for  Spec¬ 
tral  Partitioning.  Dashed  lines  are  less  than  the 
median  value  (right). 


example,  if  the  goal  is  to  construct  a  partition¬ 
ing  containing  4  subdomains  then  a  more  opti¬ 
mal  partitioning  might  be  possible  by  consider¬ 
ing  aU  four  partitions  simultaneously  when  de¬ 
termining  the  cut-weight  of  the  graph  rather 
than  cut-weights  for  pairwise  bisections.  This 
has  prompted  generalizations  [HL95]  of  the  spec¬ 
tral  bisection  algorithm  which  require  more  than 
one  eigenvector.  The  spectral  quadrisection  al¬ 
gorithm  by  Henrickson  and  Leland  [HL95]  uses 
the  first  two  eigenvectors,  y2  and  ya,  associated 
with  nonzero  eigenvalues.  The  algorithm  then 
considers  orthogonal  combinations  subject  to  a 
rotation  angle  6 

=  J/2  cos  0  +  ys  sin  6 
^3  =  -y2  sin  O  +  ys  cos  0. 

Again  using  discrete  approximation,  = 

sign(a;^*^),  =  sign(a;^*^),  partitioning  vec¬ 

tors  are  calculated  from  which  quadrants  are  as¬ 
signed  {(-t-1,  -f-1),  (-1,  +1),  (-1,-1),  (+1,  -1)}. 
The  angle  0  is  determined  by  minimizing  the  dis¬ 
tance  between  x  and  p 

minimize  ^  -  l)^  +  -  l)^ 

(1.14) 

This  has  the  effect  of  finding  continuous  solu¬ 
tions  that  are  nearby  the  desired  discrete  so¬ 
lution.  The  results  shown  in  [HL95]  are  very 
promising  and  show  a  definite  improvement  over 
the  standard  spectral  bisection  algorithm  (which 
is  already  considered  to  be  quite  good).  The 
technique  extends  naturally  to  higher  order  par¬ 
titionings. 


1.3  Graph  Quadrisection  and 
Higher  Order  Partition¬ 
ings 

One  complaint  commonly  leveled  against  recur¬ 
sive  bisection  algorithms  is  that  they  are  too 
greedy  and  lack  “look  ahead”  properties.  For 
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Chapter  2 

A  Uniprocessor  CFD  Algorithm 


2.1  Basic  Flow  Equations 

We  consider  the  standard  compressible  Navier- 
Stokes  equations  in  integral  form  for  a  domain  fl 
with  bounding  surface  dQ 

l-l  ,idV+[  (^{■n)dS=  f  (g  n)<iS  (2.1) 
ot  JQ  JdQ  JdQ 

where  u  represents  the  vector  of  conserved  vari¬ 
ables,  f  and  g  the  in  viscid  and  viscous  flux  vec¬ 
tors  respectively.  In  the  vectors  are  written 
in  Einstein  summation  form  (j  =  1, 2, . . . ,  d): 


f  =  pUiUj  -H  6ijp  1,-,  g  =  I 

\  Ui{E  +  p)  ) 

with  viscous  stresses  given  by 


UfcTjfc  -  Kqj 


T  =  A 


(dui  duj 


and  Fourier  heat  transfer  given  by  q  =  -kVT. 
Finally,  an  ideal  gas  is  assumed  p  =  pRT  =  (7  — 
1)  [e  -  +  v^)). 


2.1.1  Surface  Boundary  Conditions 

At  solid  walls  with  no  permeability  and/or  no 
slip  boundary  conditions  the  inviscid  flux  re¬ 
duces  to  the  following  form: 


f(u;  n)s  =  fs  •  n  = 


An  analysis  of  the  viscous  flux  reveals  the  follow¬ 
ing  limiting  form  for  no  slip  surfaces: 

(  ' 

g(u,  Vu;  n)s  =  gs  •  n  =  I  pVuj  •  n 

\ kVT • n 

The  last  entry  in  the  viscous  flux  vanishes  for 
adiabatic  flow.  These  conditions  can  be  enforced 
weakly.  In  addition,  for  viscous  flow  the  strong 
condition  can  be  applied  that  the  velocity  vector 
vanish  at  the  surface. 

2.1.2  Far  Field  Boundary  Conditions 
via  Characteristic  Projectors 

Let  A(u;  n)  denote  the  flux  Jacobian  matrix  di¬ 
rected  along  the  normal  vector  n 

=  (2-5) 

and  define  the  characteristic  projector  matrices 

P±  =  il/±sign(A)|.  (2.6) 

The  far  field  inviscid  fiux  is  computed  from 

foo=f(u),  U  =  P+U-|-P“Uoo  (2.7) 

so  that  the  boundary  condition  satisfies  the  weU- 
known  Friedrichs  [Fri58]  strong  solution  admis¬ 
sibility  condition  for  symmetric  hyperbolic  prob¬ 
lems: 

p-u  =  p-p+u  +  p-p-Uoo  =  P"Uoo.  (2.8) 

For  viscous  fiow,  the  inviscid  fiux  (2.7)  can  be 
combined  with  a  weak  Neumann  condition  for 
the  viscous  flux.  Alternatively,  strong  Dirich- 
let  conditions  can  be  imposed  as  dictated  by  the 
physical  problem. 


7-11 


2.2  Turbulence 

In  addition  to  the  basic  Navier-Stokes  equations, 
we  model  the  effects  of  turbulence  on  the  mean 
flow  equations  using  an  eddy  viscosity  turbulence 
model.  In  a  report  with  Baldwin  [BB90],  we 
proposed  a  single  equation  turbulence  transport 
model  with  the  specific  application  to  unstruc¬ 
tured  meshes  in  mind.  This  model  was  subse¬ 
quently  modified  by  Spalart  and  AUmaras  [SA92] 
to  improve  the  predictive  capability  of  the  model 
for  wakes  and  shear-layers  as  well  as  to  sim¬ 
plify  the  model’s  dependence  on  distance  to 
solid  walls.  In  the  present  computations,  the 
Spalart  model  is  solved  in  a  form  fuUy  coupled  to 
the  Navier-Stokes  equations.  The  one-equation 
model  for  the  viscosity-like  parameter  v  is  writ¬ 
ten 

^  =  i  [v  •  ((i/  +  v)Vu)  -I-  C(,2(V?)2] 

-Cwifw  +  CbiSu.  (2.9) 

In  the  Spalart  model  the  kinematic  eddy  viscos¬ 
ity  is  given  by 

=  vfvi  (2.10) 


2.3  The  Spatial  Discretization 
Algorithm 

The  flow  equations  are  discretized  in  space  us¬ 
ing  a  finite- volume  method.  In  this  technique 
the  solution  domain  is  tessellated  into  a  num¬ 
ber  of  smaller  subdomains  (fl  =  Uft,).  Each 
subdomain  serves  as  a  control  volume  in  which 
mass,  momentum,  and  energy  are  conserved.  In 
the  present  application,  the  control  volumes  are 
formed  from  a  median  dual  obtained  from  the 
triangulation,  see  Figure  2.1. 


2 


Triangulation 

Dirichlet  Dual 
Median  Dual 


Figure  2.1:  Local  triangulation  with  Dirichlet 
and  median  duals. 


and  requires  the  following  closure  functions  and 
constants 

I"l  +  -^f.2 


1  +  xfvl 

V 

SK^d? 

r  +  c^,2{r^  -  r) 

with  u  the  fluid  vorticity,  d  the  distance  to  the 
closest  surface,  and  the  constants  cji  =  0.1355, 
Cb2  =  0.622,  c„i  =  7.1,  Cuii  =  3.24,  0^,2  =  0.3, 
Cw3  =  2.0,  K  =  .41,  (T  =  2./3..  The  model  also 
includes  an  optional  term  for  simulating  transi¬ 
tion  to  turbulence. 


S  = 

fvl  = 
fv2  = 

r  = 
9  = 


Fundamental  to  the  finite-volume  method  is 
the  definition  of  the  integral  cell  average.  Com¬ 
ponentwise,  the  integral  cell  average  is  defined  in 
each  subdomain  as: 

Ui  =  ^  f  u  dV 
Vi  JQi 

where  Vj  =  Jq.  d  V.  The  integral  conservation 
law  can  then  be  rewritten  in  the  following  form: 

^(uy)+  /  (f.n)dS=  I  ig-n)dS.  (2.11) 

Ot  JdQ  JdU 

The  integral  cell  averages  are  the  basic  un¬ 
knowns  (degrees  of  freedom)  in  the  scheme.  The 
task  at  hand  is  to  evaluate  the  flux  integral 
given  these  cell  averages  of  the  solution.  The 
basic  solution  process  is  summarized  in  the  fol¬ 
lowing  steps  and  further  details  are  given  in 
[Bar91]  [BJ89]: 
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Beconstruction 

Given  integral  averages  of  the  solution  in  each 
control  volume,  reconstruct  a  piecewise  polyno¬ 
mial  which  approximates  the  behavior  of  the  so¬ 
lution  in  each  control  volume. 

Flux  Quadrature 

From  the  piecewise  polynomial  description  of 
the  solution,  approximate  the  flux  integral  in 
(2.11)  by  numerical  quadrature.  Because  the 
piecewise  polynomials  are  not  continuous  at  con¬ 
trol  volume  boundaries,  special  flux  functions  are 
employed  which  are  functions  of  multiple  solu¬ 
tion  states.  Those  flux  functions  which  can  be 
characterized  as  some  approximate  and/or  ex¬ 
act  solution  of  the  Riemann  problem  of  gasdy- 
namics  result  in  upwind  biased  approximations. 
Present  computations  utilize  Roe’s  approximate 
Riemann  solver  [RoeSl]. 

Evolution 

Given  a  numerical  approximation  to  the  flux 
integral,  evolve  the  system  in  time  using  any 
class  of  implicit  or  explicit  schemes.  This  results 
in  new  integral  cell  averages  of  the  solution.  The 
solution  process  can  then  be  repeated. 

It  is  important  to  realize  that  for  steady-state 
calculations,  the  spatial  accuracy  of  the  scheme 
depends  solely  on  the  reconstruction  and  flux 
quadrature  steps.  Moreover,  the  use  of  ceU  aver¬ 
ages  can  be  replaced  by  pointwise  values  of  the 
solution  associated  with  each  control  volume.  In 
our  application,  we  place  the  solution  unknowns 
at  mesh  vertices.  As  we  will  see,  this  can  greatly 
simplify  the  reconstruction  step.  Unfortunately, 
schemes  based  on  these  reconstructed  polynomi¬ 
als  are  subject  to  the  generation  of  spurious  os¬ 
cillations  near  discontinuities  and  regions  of  high 
solution  gradient  unless  additional  measures  are 
taken  which  limit  extremum  behavior  of  the  re¬ 
constructed  polynomial.  These  measures  are  the 
basis  for  the  class  of  MUSCL  schemes  devel¬ 
oped  by  van  Leer  [vL79].  This  framework  of 
reconstruction  followed  by  monotonicity  enforce¬ 
ment  extends  naturally  to  unstructured  meshes 
in  higher  dimensions  and  sufficient  conditions  re¬ 
quired  by  the  reconstructed  polynomial  to  guar¬ 
antee  monotonicity  are  generally  known,  see  for 
example  [Bar 94]. 


2.3.1  Linear  Least-Squares  Recon¬ 
struction 

Consider  a  vertex  vq  and  suppose  that  the  solu¬ 
tion  varies  linearly  over  the  support  of  adjacent 
neighbors  of  the  mesh.  In  this  case,  the  change 
in  vertex  values  of  the  solution  along  an  edge 
e(u,-,Uo)  can  be  calculated  by 

(Vu)o  •  (ri  -  ro)  =  Ui  -  uq 


where  r  denotes  the  spatial  position  vector.  This 
equation  represents  the  scaled  projection  of  the 
gradient  along  the  edge  e(vi,  vq).  A  similar  equa¬ 
tion  could  be  written  for  aU  incident  edges  sub¬ 
ject  to  an  arbitrary  weighting  factor.  The  result 
is  the  following  matrix  equation,  shown  here  in 
two  dimensions: 

"wiAxi  wiAyi 

.WnAXn  WnAyn 


/  \ 

/Wi{Ui-Uo)\ 

• 

\WniUn-Uo)/ 

or  in  symbolic  form  £  Vu  =  f  where 

^=[ti  t2] 

in  two  dimensions.  Exact  calculation  of  gradi¬ 
ents  for  linearly  varying  u  is  guaranteed  if  any 
two  row  vectors  W{(r{  -  ro)  span  all  of  2  space. 

This  implies  linear  independence  of  L  i  and  L  2. 
The  system  can  then  be  solved  via  normal  equa¬ 
tions 


Vi' 

r  ^  -^1  1 

1 

O' 

.V2. 

II 

.0 

1. 

The  row  vectors  V  ,■  are  given  by 


_  I22  L  \  —  ^12  L  2  ^  _  hi  L  2  —  W2  L  1 

/11/22  ■“  ^12  *  hih2  ~  ^12 

(2.12) 

with  Uj  =  {L  i  •  L  j). 

Note  that  reconstruction  of  N  independent 
variables  in  implies  (“^^ ^)  +  dN  inner  prod¬ 
uct  sums.  Since  only  dN  of  these  sums  involves 
the  solution  variables  themselves,  the  remain¬ 
ing  sums  could  be  precalculated  and  stored  in 
computer  memory.  Using  the  edge  data  struc¬ 
ture,  the  calculation  of  inner  product  sums  can 
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be  calculated  for  arbitrary  combinations  of  poly-  and  similarly  the  residual  vector  R  for  aU  mesh 
hedral  cells.  In  all  cases  linear  functions  are  re-  vertices.  The  basic  scheme  is  written  as 
constructed  exactly. 

DVt  =  R(U)  (2.13) 

Algorithm:  Weighted  Least-Squares  Gradient  ^ 

cu  ation  ^  backward  Euler  time  integration,  equation 

For  k  =  l,  n(e)  Loop  through  edges  (2.13)  is  rewritten  as 

ji  =  e  Edge  origin 

J2  =  e“^(fc,2)  Edge  destination  -  U")  =  A/ R(U"'''^) 

Ax  =  w{k)  ■  {x{j2)  -  x(ji))  Weighted  Ax 


j2  =  e~^(k,2)  Edge  destination  -  U")  =  A/ R(U"'''^) 

Ax  =  w{k)  ■  {x{j2)  -  x(ji))  Weighted  Ax 

Ay  =  w{k)  •  {y(j2)  -  y{ji))  Weighted  Ay  where  n  denotes  the  iteration  (time  step)  level. 

^11  (ji)  =  ^11  (il)  +  Ax  •  Ax  /ii  orig  sum  Linearizing  the  right-hand-side  of  the  preceding 

hi{j2)  =  htiJi)  +  Ax  •  Ax  /ii  dest  sum  equation  in  time  produces  the  following  form: 

/i2(ji)  =  li2(ji)  +  Ax-Ay  Z12  orig  sum 

/i2(i2)  =  /i2(j2)  +  Ax  •  Ay  /12  dest  sum  D(U"+LU")  =  At  R(U”)  -f-  ^(U”+i-  U") 

Au  =  w{k)  ■  (u(j2)  -  u{ji))  Weighted  Au  ^2  14^ 

/i(ji)  +  =  Ax  •  Au  L  if  sum  By  rearrangement  of  terms,  we  arrive  at  the  delta 

/i(j2)  +  =  Ax  •  Au  form  of  the  backward  Euler  scheme 

/2(ji)  +  =  Ay  •  Au  X2/sum  Tj)  ^  ,, 

f2(h)  +  =  AyAu  is  -  ^ 

Endfor 

Note  that  for  large  time  steps,  the  scheme  be- 
For  j  =  l,n(v)  Process  vertices  dividing  by  det  equivalent  to  Newton  s  method.  In  prac- 

det  =  Zii(j)  •  ^22(7)  —  ^i2(i)  diagonal  entries  are  locally  scaled  as  a 

Ux{j)  =  (/22(i)  •  fi(j)  -  /i2(j)  •  h{j))ldet  exponential  function  of  the  norm  of  the  residual 

“3/(i)  =  (^u(i)  •  h{j)  -  ^12(7)  •  fiij))/det  j). 

Endfor  7-7  =  -5 - ,  Cfl^aa;  =  /(||R(U)"1|) 


where  n  denotes  the  iteration  (time  step)  level. 
Linearizing  the  right-hand-side  of  the  preceding 
equation  in  time  produces  the  following  form: 

r  ^ 

Z)(U"+LU")  =  At  R(U”)  -f-  — (U"+i-  U") 

dXJ 

(2.14) 

By  rearrangement  of  terms,  we  arrive  at  the  delta 
form  of  the  backward  Euler  scheme 


(U”+i-U")=  R(U").  (2.15) 


This  formulation  provides  freedom  in  the  choice 
of  weighting  coefficients,  u;,.  These  weighting 
coefficients  can  be  a  function  of  the  geometry 
and/or  solution.  Classical  approximations  in  one 
dimension  can  be  recovered  by  choosing  geomet¬ 
rical  weights  of  the  form  Wi  =  l./|Ar,  -  Aro|* 
for  values  of  t  =  0, 1,2.  Data  dependent  choices 
are  discussed  in  [Bar94]. 


so  that  when  ||R(U)||  -)■  0,  c^max  00  and  the 
scheme  approaches  Newton’s  method.  It  should 
be  emphasized  that  by  using  this  strategy,  the 
scheme  is  technically  an  approximate  Newton 
method  which  becomes  exact  only  in  the  final 
few  iterations  of  the  computation. 

The  following  two  sections  present  examples 
which  demonstrate  the  convergence  character¬ 
istics  of  Newton’s  method  for  inviscid  and  vis¬ 


2.4  Exact  and  Approximate 
Newton  Methods 

In  this  section  we  consider  implicit  solution 
strategies  for  the  upwind  discretization  scheme 
described  in  the  previous  section.  Define  the  so¬ 
lution  vector 

U  =  [ui,U2,U3,...,Uiv]^ 


cous  fluid  flow  problems.  In  viewing  these  ex¬ 
amples,  the  reader  can  assume  that  each  matrix 
problem  required  in  the  Newton  scheme  is  solved 
“exactly.”  In  reality,  these  matrix  problems  are 
solved  iteratively  to  a  user  specified  tolerance. 
The  topic  of  solving  the  linear  algebra  problem 
wiU  be  discussed  in  detail  in  later  sections.  The 
test  case  examples  are  presented  at  this  time  so 
that  they  may  be  used  in  the  remaider  of  these 
notes  for  comparison  purposes. 
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2.4.1  Test  Case  1:  Inviscid  Flow  Past 
a  Multi-Element  Airfoil 


As  a  first  test  case,  inviscid  Euler  flow  is  com¬ 
puted  about  a  multi-element  airfoil  geometry  as 
shown  in  Figure  2.2. 


tices. 


Figure  2.3:  Solution  isomach  contours  about 
multi-element  geometry,  Moo  =  0.2,  a  —  2.0°. 

The  mesh  contains  approximately  4900  mesh 
vertices.  Subsonic  flow  conditions  are  imposed 


(Moo  =  0.2)  with  a  2°  free  stream  angle  of  attack. 
Figures  2.3  -  2.5  show  Mach  number  contours, 
surface  pressure  coefficient,  and  convergence  his¬ 
tory  for  the  calculation.  An  initial  time  step  was 
chosen  for  the  calculation  which  corresponds  to 
an  effective  local  CFL  number  of  approximately 
50,  but  over  the  next  10  iterations  the  effective 
CFL  number  quickly  reaches  10®.  This  test  case 
will  be  used  extensively  in  Chapter  3  when  eval¬ 
uating  parallel  solution  strategies. 


x/c 

Figure  2.4:  Surface  pressure  coefficient  com¬ 
puted  from  multi-element  geometry. 


Figure  2.5:  Numerical  solution  convergence  his¬ 
tory. 
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2.4.2  Test  Case  2:  Viscous  Flow  Past 
a  Multi-Element  Airfoil 


Figure  2.6:  Multi-element  airfoil  triangulation, 
22,000  vertices. 


Figure  2.7:  Multi-element  airfoil  solution  iso- 
mach  contours,  Mq©  =  0.2,  a  =  16.0°,  Re  =  9.0 
million. 

As  a  second  test  case,  viscous  flow  with  tur¬ 


bulence  is  computed  about  the  multiple-element 
airfoil  geometry.  This  geometry  has  been  trian¬ 
gulated  using  the  Steiner  triangulation  algorithm 
described  in  [Bar95],  see  Figure  2.6.  The  mesh 
contains  approximately  22,000  vertices  with  cells 
near  the  airfoil  surface  attaining  aspect  ratios 
greater  than  1000:1.  This  example  provides  a 
demanding  test  case  for  CFD  algorithms.  The 
experimental  flow  conditions  are  M^o  —  .20, 
a  =  16°,  and  a  Reynolds  number  of  9  million. 
Experimental  results  are  given  in  [VDMG92]  and 
computed  results  are  shown  in  Figure  2.7.  Even 
though  the  wake  passing  over  the  main  element 
is  not  well  resolved,  the  surface  pressure  coeffi¬ 
cient  shown  in  Figure  2.8  agrees  quite  well  with 
experiment.  The  convergence  history  in  Figure 


Figure  2.8:  Comparison  of  computational  and 
experimental  surface  pressure  coefficients. 

2.9  shows  that  roughly  twice  as  many  iteration 
steps  are  needed  for  the  viscous  turbulent  ffow 
calculation  when  compared  to  the  inviscid  ffow 
computation  of  Test  Case  1.  This  seems  to  be 
typical  for  aerodynamic  high  lift  computations. 
This  is  contrasted  by  single  element  airfoil  com¬ 
putations  which  show  very  little  difference  in  the 
number  of  iterations  needed  when  computing  in¬ 
viscid  and  viscous  flow.  This  test  case  will  also 
be  used  extensively  in  subsequent  chapters  for 
evaluating  various  solution  strategies. 
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Figure  2.9:  Solution  convergence  history  for  Case 
2  computation. 

2.4.3  Storage  Requirements 

It  is  worthwhile  to  assess  the  computer  mem¬ 
ory  requirements  for  storing  sparse  matrices  ob¬ 
tained  from  discretizations  on  simphcial  meshes 
(triangulations).  In  practice  we  will  be  solv¬ 
ing  systems  of  I  coupled  equations  so  that  each 
nonzero  entry  of  the  matrix  is  actually  a  small 
I  X  I  block.  The  schemes  discussed  in  previous 
sections  require  data  from  distance-one  neigh¬ 
bors  in  the  graph  (mesh).  In  addition,  the 
higher  order  accurate  schemes  require  distance- 
two  neighbors  in  building  the  scheme.  First 
consider  the  situation  in  which  the  scheme  re¬ 
quires  only  distance-one  neighbors.  The  num¬ 
ber  of  nonzero  entries  in  each  row  of  the  ma¬ 
trix  is  related  to  the  number  of  edges  incident  to 
the  vertex  associated  with  that  row.  Or  equiva¬ 
lently,  each  edge  e{vi,Vj)  will  guarantee  nonzero 
entries  in  the  i-th  column  and  j-th  row  and  sim¬ 
ilarly  the  j-th  column  and  i-th  row.  In  addi¬ 
tion  nonzero  entries  wiU  be  placed  on  the  diago¬ 
nal  of  the  matrix.  From  this  counting  argument 
we  see  that  the  number  of  nonzero  block  entries, 
nnz,  in  the  matrix  is  exactly  twice  the  number 
of  edges  plus  the  number  of  vertices,  2E  +  N  (ap¬ 


proximately  7N  in  2-d).  Using  a  similar  count¬ 
ing  argument  we  obtain  the  following  approxi¬ 
mate  requirements  for  storing  distance-one  and 
distance-two  neighboring  information  as  a  sparse 
matrix:  Note  that  the  entries  of  the  sparse  ma- 

Table  2.1:  Storage  Estimates  for  Sparse  Matrices 


Dim. 

nnz  ( Distance- 1) 

nnz  (Distance-2) 

2 

IN 

19A 

3 

UN 

55W 

trix  associated  with  Newton’s  method  are  actu¬ 
ally  small  5x5  and  6x6  blocks  in  two  and 
three  dimensions  respectively.  At  first  glance, 
this  storage  requirement  appears  prohibitively 
large.  While  this  may  be  true  to  some  extent 
today,  the  memory  capacity  of  computers  is  ex¬ 
panding  at  a  rapid  rate.  It  is  quite  reasonable 
to  expect  that  in  the  foreseeable  future  sufficient 
memory  wiU  be  available  for  solving  most  prob¬ 
lems  of  engineering  interest.  Even  so,  it  is  pos¬ 
sible  to  reduce,  and  in  some  cases  ehminate,  the 
explicit  storage  of  the  Jacobian  matrix  without 
compromising  the  favorable  convergence  charac¬ 
teristics  of  Newton’s  method.  These  techniques 
will  be  discuss  in  subsequent  sections. 

2.4.4  Calculating  Analytic  Jacobian 
Derivatives 

In  this  section  we  address  the  task  of  comput¬ 
ing  Jacobian  derivatives  for  Newton’s  method. 
In  the  following  section  we  consider  the  related 
task  of  multiplying  an  arbitrary  vector  by  the 
Jacobian  matrix. 

A  major  task  in  the  overall  calculation  of  the 
Jacobian  derivatives  for  the  finite- volume  dis¬ 
cretization  is  the  hnearization  of  the  numeri¬ 
cal  flux  vector  with  respect  to  the  two  solution 
states,  e.g.  given  the  Roe  flux  function  [RoeSl] 

h(u^u^;n)  =  ^  (f(u^n) -t- f(u^,n))2.16) 

-  ;^l^(u^u^;n)|  (u^ -(iS>9^7) 
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we  require  the  Jacobian  terms  and 
Exact  analytical  expressions  for  these  terms  are 
available  [Bar87].  In  constructing  the  Jacobian 
matrix  for  the  entire  scheme  it  is  useful  to  con¬ 
ceptualize  the  finite-volume  scheme  in  composi¬ 
tion  form: 

R(U)  =  £i(£2(U))  (2.18) 

with  Cl  representing  the  flux  quadrature  and  ac¬ 
cumulation  step  and  £2  representing  the  data 
reconstruction  step.  In  this  form,  each  operator 
requires  distance-1  information.  The  Jacobian 
matrix  can  then  be  written  as 


Sparse  Matrix- Vector  Multiply 

The  most  straightforward  strategy  is  to  analyt¬ 
ically  compute  and  store  the  Jacobian  matrix 
using  a  compressed  storage  scheme  designed  for 
sparse  matrices.  This  strategy  has  the  added 
benefit  that  a  copy  of  the  matrix  can  also  be  used 
as  a  preconditioner  for  the  iterative  solver.  In  ad¬ 
dition,  the  explicit  storage  also  permits  the  for¬ 
mation  of  the  transposed  matrix  problem  which 
is  often  encountered  in  optimization  procedures 
coupled  with  Newton’s  method.  Obviously,  a  de¬ 
traction  of  this  approach  is  the  large  storage  re¬ 
quirement. 


_  dCi  dC2  , 

dU  “  dC2  dV  ^  ’ 

with  the  critical  observation  that  the  Jacobian 
matrix  can  be  calculated  as  the  sparse  product 
of  two  matrices.  This  could  potentially  be  an 
expensive  task,  but  because  of  the  special  form  of 
£1  and  £2,  the  resulting  sparse  product  produces 
at  most  distance-2  fill  and  can  be  computed  at 
reasonable  cost. 


2.4.5  Exact  and  Approximate  Jaco¬ 
bian  Matrix- Vector  Products 

Consider  the  standard  matrix  equation  Ax  — b  = 
0.  As  we  wiU  see,  iterative  matrix  solution 
algorithms  for  this  problem  such  as  the  gen¬ 
eralized  minimum  residual  method  (GMRES) 
and  the  stablized  bi-conjugate  gradient  method 
(Bi-CGSTAB)  both  require  the  computation  of 
matrix- vector  products  of  the  form  Ap  for  some 
arbitrary  p  vector.  In  the  approximate  Newton 
algorithm 

~  [a/  dU. 

so  that  a  major  computation  in  the  matrix- 
vector  product  Ap  is  the  computation  of  Jaco¬ 
bian  derivatives  in  the  direction  of  p  (a  Frechet 
derivative) 

Ap  =  ^p.  (2.21) 

Several  possible  strategies  exist  for  computing 
the  needed  Frechet  derivatives: 


Approximate  Frechet  Derivatives 

An  alternative  to  the  analytic  calculation  of 
Frechet  derivatives  is  to  approximate  them  using 
a  finite- difference  approximation  [Joh92]  [BS94] 
[EW94].  The  required  Frechet  derivative  is  a  lim¬ 
iting  form  of  the  difference  approximation 


dR 

— p  =  hm 
dU  £-^0 


R(U  -f  6p)  -  R(U) 
€ 


The  primary  concern  with  this  approach  is  the 
accuracy  of  derivatives  and  the  optimal  choice 
for  e.  If  derivatives  are  not  computed  accurately 
then  methods  such  as  GMRES  or  Bi-CGSTAB 
iteration  may  stall  or  fail.  Using  a  forward  dif¬ 
ference  approximation,  e  must  be  carefully  cho¬ 
sen.  In  general  it  is  insufficient  to  choose  e  as  a 
constant  such  as  the  square  root  of  machine  pre¬ 
cision.  Johan  [Joh92]  also  mentions  this  fact  and 
gives  some  analysis  for  choosing  €  but  this  analy¬ 
sis  assumes  that  R(u)  is  well  scaled.  A  common 
choice  for  e  is  given  by 


c  —  ^0  +  f]  [7  (2.22) 

IIpII 

with  suitably  chosen  constants  and  ^i.  An 
alternative  is  to  use  higher  order  accurate  for¬ 
mulas  such  as  central  differencing  at  double  the 
computational  cost. 

The  clear  attraction  of  this  approach  is  the 
low  memory  requirement.  On  the  other  hand, 
the  numerical  computation  of  Frechet  derivatives 
does  not  produce  a  matrix  approximation  which 
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can  be  used  to  precondition  the  system.  Lastly,  Lemma:  Let  v  =  TZ{U)  ■=  7Z(ui,U2,  de¬ 

fer  situations  requiring  the  solution  of  the  trans-  note  an  arbitrary  order  reconstruction  operator, 
posed  matrix  problem,  there  does  not  appear  to  If  7Z  depends  linearly  on  u,-  then 
be  a  Frechet-like  technique  for  constructing  the 

matrix- vector  product  —p  =  IZip). 

du 


using  numerical  difference  approximations.  We 
consider  this  a  serious  shortcoming  of  the 
method. 

Exact  Product  Forms 

In  this  section  we  wiU  present  a  technique  for 
constructing  matrix-vector  products  which  is  an 
exact  calculation  of  the  Frechet  derivative.  Ex¬ 
tension  to  systems  and  the  inclusion  of  diffusion 
terms  are  also  handled  using  this  technique. 

Let  G(E,  V)  denote  the  triangulation  in  2-d 
or  3-d  with  n  vertices  and  m  edges.  Next  recall 
the  definition  of  the  incidence  matrix  given  in 
Equation  (1.2): 

{—1  if  Vi  is  the  origin  of  edge  I 
1  if  Vi  is  the  destination  of  edge  /  • 

0  otherwise 

(2.23) 

Let  h  =  h(u'^,u^;n)  denote  the  numerical  flux 
function  as  defined  by  Equation  2.17.  For  a  sys¬ 
tem  of  I  coupled  differential  equations,  the  Ja¬ 
cobian  matrix  entries  are  actually  small  I  x  I 
blocks.  For  ease  of  exposition,  we  tacitly  treat 
these  small  blocks  as  scalar  entries.  Under  these 
simplifications,  the  desired  matrix-vector  prod¬ 
uct  is  given  by 

dR  j  ■  dh  ■  du^  ■  dh  '  du^ 
dU^  .du^.  du  ^  .du^.  du  ^ 

(2.24) 

where  £  ^rnx.m  nonzero  diagonal  ele¬ 
ments,  and  If  we  do  not  incorpo¬ 

rated  monotonicity  enforcement  into  the  recon¬ 
struction  procedure  then  a  considerable  simplifi¬ 
cation  occurs  in  the  calculation  of  matrix- vector 
products.  The  main  idea  is  given  in  the  following 
almost  trivial  lemma. 


Proof:  Linearity  implies  that 

n 

V  =  'R,{ui,U2,...,Un)  =  '^CXiUi 
i=l 

so  that  ^  The  desired  result  follows  im¬ 

mediately 


j  n  j  n 

dv  dv 

:r;P  =  L:T-K  =  L«iK  =  K(;.). 


This  lemma  suggests  the  following  procedure  for 
calculation  of  matrix- vector  products. 


(2.25) 


This  amounts  to  a  reconstruction  of  the  vec¬ 
tors  and  from  p  using  the  same  recon¬ 
struction  operator  used  in  the  residual  computa¬ 
tion.  Next,  the  linearized  form  of  the  flux  func¬ 
tion  is  computed: 


_ih_  r. 


du^ 


Finally,  the  linearized  fluxes  are  assembled  using 
the  same  procedure  as  the  residual  vector  assem¬ 
bly.  In  actual  calculations,  the  conservative  flow 
variables  are  not  reconstructed,  thereby  necessi¬ 
tating  that  a  change  of  variable  transformation 
be  embedded  in  the  formulation.  This  is  not  a 
serious  complication. 

Equation  (  2.25)  does  not  reveal  how  to  con¬ 
struct  the  transposed  matrix- vector  product 


But  by  introducing  addition  matrices,  we  can  de¬ 
rive  the  required  equation.  In  addition,  the  fol¬ 
lowing  forms  allow  the  incorporation  of  mono¬ 
tonicity  limiting  in  the  reconstruction  process. 


although  we  have  not  done  so  here.  Define 
A,S+,S-  € 

If  e{vi,Vj)  G  G{E,V),  then 

Aie  =  =  1,  Aje  =  S-  =  1 

and  zero  otherwise.  In  addition,  define  the  diag¬ 
onal  m  X  m  matrices  containing  weighted  edge 
geometry  [Ax]  and  [Ay]  as  well  as  the  n  X  n 
diagonal  matrices  [-D„J  containing  pointwise  de¬ 
terminant  values  for  the  least  squares  solution. 
Using  these  matrices  the  left  and  right  recon¬ 
structed  states  obtained  by  least  squares  recon¬ 
struction  are  given  by 


From  these  formulas  the  transposed  matrix- 
problem  problem  is  easily  calculated 


=[i-i 

+  C  [Ai]  [£>„]  -t-  [Aj,]  [!>.,]  [S-]  [^] 

+  C  [Ax]A^[Dy.]+[Ay]A^[Dyy]  [5”] 

-  C  [Ax]A'^[D,.]  +  [Ay]A^[D,y]  [5+]  [^] 

-  e  [Ax]A^[Dy.]  +  [Ay]A^[Dyy]  [5+]  [^] 

[_^1^  r 

(2.26) 
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Just  as  equations  (  2.26)  have  an  implemen¬ 
tation  using  an  edge  data  structure  (one  would 
never  store  the  connectivity  as  ./I  or  C  in  dense 
matrix  form),  the  transposed  equation  has  an 
implementation  using  an  edge  data  structure  for 

r  iTj  ]  T 

the  calculation  of  fu  P'  example,  the  ma¬ 
trix  operation  A^v  performs  a  gather  and  sum 
of  the  two  edge  vertex  values  of  v  for  each  and 
every  edge.  The  matrix  operation  Aw  performs 
a  scatter  and  accumulate  of  an  edge  quantity  w 
to  the  two  edge  vertices  locations  for  each  and 
every  edge.  Similar  edge  operations  exist  for  the 
incidence  matrix  C.  Thus  we  have  constructed 
a  technique  for  matrix-vector  products  based  on 
elementary  edge  operations  which  also  permits 
constructing  the  transposed  matrix- vector  prod¬ 
uct.  The  ability  to  write  the  entire  algorithm  in 
terms  of  a  sequence  of  edge  operations  makes  the 
parallel  implementation  straightforward. 

2.4.6  Solving  the  Matrix  Problem 

The  next  task  is  to  solve  the  large  sparse  linear 
system  of  the  form 

Ap  —  b  =  0 

produced  by  Newton’s  method.  Owing  to  the 
nonsymmetry  of  A,  we  consider  solving  this 
system  using  the  generalized  minimum  residual 
method  (GMRES)  of  Saad  and  Schultz  [SS86] 
and  the  stabilized  bi-conjugate  gradient  method 
of  Van  der  Vorst  [vdV92].  Both  algorithms  are 
outlined  in  Table  2.2.  The  paragraphs  given  be¬ 
low  briefly  describe  the  methods  but  for  a  full 
description  we  defer  to  the  lectures  of  Prof.  Van 
der  Vorst. 

The  GMRES  Algorithm 

The  GMRES  algorithm  explicitly  forms  an 
orthogonal  basis  [vq,  Vi,V2,. .  .,v^]  from  the 
Krylov  sequence  [fq,  Aro,  A^fq,  . . . ,  A*’“^ro]  us¬ 
ing  a  modified  Gram-Schmidt  orthogonalization 
procedure.  Using  this  orthogonal  basis,  GMRES 
iterates  are  computed 


Pfc  =  Po  +  0!ivi  -I-  02 V2  +  . .  •  +  ajtvjt  (2.27) 
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by  minimizing  the  residual  norm 

||Apfc-b||.  (2.28) 

The  algorithm  requires  A;  +  1  vector  inner  prod¬ 
ucts,  fc  -t-  1  SAXPY  operations,  and  k  matrix- 
vector  multiplies  for  iteration  k.  Thus  as  k  in¬ 
creases,  the  storage  increases  linearly  and  the 
computation  quadraticaUy.  To  avoid  the  stor¬ 
age  and  computation  demands  imposed  by  large 
matrices,  Saad  and  Schultz  devised  a  variant, 
GMRES(A;),  in  which  the  GMRES  algorithm  is 
restarted  every  k  steps.  The  proper  choice  of  k 
is  problem  dependent  and  a  strong  function  of 
the  choice  and  quality  of  preconditioning. 

The  Bi-CGSTAB  Algorithm 

The  stabilized  bi-conjugate  gradient  method  (Bi- 
CGSTAB)  is  a  short  recurrence  method  designed 
for  nonsymmetric  matrices.  Roughly  speaking, 
Bi-CGSTAB  combines  the  basic  bi-conjugate 
gradient  method  with  GMRES(l).  The  inclusion 
of  the  GMRES(l)  steps  is  intended  to  smooth 
the  irregular  convergence  behavior  of  the  basic 
Bi-CG  method.  The  Bi-CGSTAB  method  re¬ 
quires  4  vector  inner  products,  6  SAXPY  op¬ 
erations,  and  2  matrix- vector  products  for  each 
iteration. 

Matrix  Preconditioning 

Practical  experience  has  shown  that  the  success 
or  failure  of  the  GMRES  and  Bi-CGSTAB  al¬ 
gorithm  hinges  heavily  on  the  choice  of  matrix 
preconditioning.  In  left  preconditioned  form,  the 
matrix  problem  becomes 

P(Ap  —  b)  =  0.  (left  preconditioned)  (2.29) 

An  alternative  is  the  right  preconditioned  system 

APP~^  p  _  b  —  0.  (right  preconditioned) 

(2.30) 

If  available,  the  optimal  choice  of  P  (left  or 
right)  is  clearly  A~^  or  its  L  —  U  factors.  In 
this  instance  the  underlying  matrix  problem  is 
trivially  solved  in  one  step.  More  generally,  we 
consider  finding  a  nearby  preconditioning  matrix 


such  that  k{AP)  <  k(A),  i.e.  AP  is  better  con¬ 
ditioned  than  A  alone. 

In  the  present  applications,  we  consider  a  pre¬ 
conditioning  matrix  based  the  incomplete  lower- 
upper  (ILU)  factorization  of  the  matrix  A.  ILU 
preconditioning  is  a  popular  and  robust  precon¬ 
ditioning  procedure  for  use  in  iterative  matrix 
solvers.  ILU  factorization  is  a  modification  to 
the  standard  Gaussian  elimination  for  which  the 
nonzero  fill  pattern  is  either  preimposed  or  de¬ 
termined  dynamically  based  on  the  size  or  lo¬ 
cation  of  fill  elements.  In  this  way  the  amount 
of  storage  required  can  be  specified  and  in  some 
instances  minimized.  Technical  aspects  of  ILU 
factorization  such  as  existence  and  spectral  prop¬ 
erties  have  been  proven  for  M-matrices,  but  the 
general  applicability  is  much  broader  and  well 
documented  in  the  literature.  The  triangular 
solves  required  in  the  application  of  ILU  precon¬ 
ditioning  generally  give  the  method  global  sup¬ 
port.  This  is  usually  considered  a  favorable  char¬ 
acteristic  of  the  method. 

The  finite- volume  scheme  with  high  order  data 
reconstruction  suggests  two  possible  matrices 
suitable  for  incomplete  factorization. 


1.  Distance- 1  matrix  preconditioning.  Con¬ 
struct  the  preconditioning  matrix  from 
the  Jacobian  matrix  associated  with  the 
lower  (first)  order  accurate  discretization  of 
the  flow  equations.  This  matrix  involves 
distance-1  neighbors  in  the  triangulation. 
Matrix-vector  products  are  computed  “ex¬ 
actly”  using  the  Jacobian  matrix  associated 
with  the  full  second  order  accurate  scheme. 


2.  Distance-2  matrix  preconditioning.  Use  the 
Jacobian  matrix  of  the  entire  second  or¬ 
der  accurate  scheme  for  both  matrix-vector 
products  and  preconditioning. 
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Algorithm:  Preconditioned  GMRES(ifc) 

For  /  =  1,  m 

m  restart  iterations 

Vo  :=  b  -  Apo 

initial  residual 

P  •=  Ikolk 

initial  residual  norm 

vi  ;=  ro//? 

define  initial  Krylov 

For  j  =  l,k 

inner  iterations 

yj  ■■=  Pvj 

preconditioning 

w  :=  Ayj 

matrix-vector  prod 

II 

Gram-Schmidt 

hij  :=  (w,v,) 

> 

1 

jl 

End  For 
hj+i,j  :=  ||w||2 

Vj+i  :=  yf/hj+ij 

define  Krylov  vector 

End  For 

z  :=  min^  ||/?ei  -  Hz\\2 

least  squares  solve 

P  :=  Po  +  Eili  y.z.- 

approximate  solution 

If  ||/?ei  —  Hz\\2  <  €  exit 

convergence  check 

Po  :=  P 

restart 

End  For 

Algorithm:  Preconditioned  Bi-CGSTAB 

ro  :=  b  -  Apo 
f  :=  To 

initial  residual 

For  i  =  1,  m 

Pi-i  :=  Fr.-i 

m  total  iterations 

If  /?,_!  =  0  (Breakdown) 
If  i  =  1 

y.  :=  r,_i 

Else 

method  fails 

/?i-i  :=  {pi-i/pi-2)(ai-i/ 

■I'i-l) 

yi  ■=  r.-i  +  i3i-i{yi-T- oji 

Endif 

-lV,_l) 

y  :=  Pyi 

preconditioning 

Vi  :=  Ay 
at  •=  Pi-i/Fv,' 
s  :=  r,-_i  -  a,Vi 

matrix-vector  prod 

If  ||s||2  <  c 

P.-  :=  Pi-i  -I-  a.y 

Exit 

Endif 

check  tolerance 

s  :=  Ps 

preconditioning 

t  :=  As 

LJi  :=  t^s/t^t 

Pi  :=  Pi-1  +  aiyi  +WiS 
r,-  :=  s  —  Wjt 

If  ||r,||2  <  e  Exit 

matrix-vector  prod 

If  w,  =  0  (Breakdown) 
End  For 

method  fails 

Table  2.2:  GMRES  and  Bi-CGSTAB  Algorithms 
for  Nonsymmetric  Matrices 

Performance  of  GMRES  and  Bi-CGSTAB 
for  Case  1  and  Case  2  Test  Problems 

The  test  problems  given  in  Sections  2.4.1  and 
2.4.2  provide  representative  matrices  for  evalu¬ 
ating  the  GMRES  and  Bi-CGSTAB  algorithms. 
In  evaluating  the  iterative  methods  we  construct 
approximate  Newton  matrices  corresponding  to 
flow  CEL  numbers  of  10^  and  10®.  In  addition, 
distance- 1  and  distance-2  preconditioning  matri¬ 
ces  are  used  to  accelerate  the  algorithms.  Figures 
2.10-2.11  graph  the  convergence  histories  for  the 
GMRES  and  Bi-CGSTAB  algorithms  in  solving 
matrix  problems  produced  from  the  inviscid  flow 
problem  given  in  Section  2.4.1.  Since  the  matrix- 
vector  products  and  preconditioning  solves  dom¬ 
inate  the  iterative  calculation,  convergence  his¬ 
tories  are  plotted  against  the  number  of  matrix- 
vector  products  required.  Each  GMRES  iter¬ 
ation  requires  one  matrix-vector  product  while 
each  Bi-CGSTAB  iteration  requires  two  prod¬ 
ucts.  The  GMRES  algorithm  is  clearly  adversely 
affected  by  the  distance- 1  preconditioning.  For 
this  case  the  distance- 1  preconditioned  system 
requires  roughly  twice  as  many  iterations  as  the 
distance-2  preconditioned  system.  These  graphs 
also  show  the  somewhat  erratic  convergence  be¬ 
havior  of  the  Bi-CGSTAB  method.  Even  so,  the 
Bi-CGSTAB  appears  to  require  a  similar  number 
of  matrix- vector  products  when  compared  to  the 
GMRES  algorithm. 

The  second  test  case  given  in  Section  2.4.2  is 
more  revealing.  Matrices  have  been  taken  from 
this  test  case  corresponding  to  CFL  numbers 
of  10®  and  10®.  Computations  show  a  definite 
degradation  in  convergence  for  both  methods  us¬ 
ing  the  distance- 1  preconditioning,  see  Figures 
2.12-2.13.  In  fact  for  CFL  =  10®,  the  conver¬ 
gence  is  unacceptably  slow.  In  general  we  find 
that  when  using  the  distance- 1  preconditioning 
matrix,  an  optimal  CFL  number  exists  for  con¬ 
vergence  and  efficiency  which  is  large  but  not 
infinite. 
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Figure  2.10:  Case  1  (Inviscid  Flow)  matrix  solu¬ 
tion  convergence  histories  for  the  GMRES{20) 
and  Bi  —  CGSTAB  algorithms  at  CFL  =  10® 
using  ILU(O)  distance- 1  and  distance-2  precon¬ 
ditioning  matrices 


Figure  2.11:  Ccise  1  (Inviscid  Flow)  matrix  solu¬ 
tion  convergence  histories  for  the  GMRES{20) 
and  Bi  —  CGSTAB  algorithms  at  CFL  =  10® 
using  ILU(O)  distance- 1  and  distance-2  precon¬ 
ditioning  matrices 


Figure  2.12:  Ccise  2  (Viscous  Flow)  matrix  solu¬ 
tion  convergence  histories  for  the  GMRES{30) 
and  Bi  —  CGSTAB  algorithms  at  CFL  =  10® 
using  ILU(O)  distance- 1  and  distance-2  precon¬ 
ditioning  matrices 


Figure  2.13:  Case  2  (Viscous  Flow)  matrix  solu¬ 
tion  convergence  histories  for  the  GMRES{30) 
and  Bi  —  CGSTAB  algorithms  at  CFL  =  10® 
using  ILU(O)  distance- 1  and  distance-2  precon¬ 
ditioning  matrices 
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Chapter  3 

Parallel  Algorithms 


3.1  Additive  and  Multiplica¬ 
tive  Schwarz  Procedures 

In  this  section  we  review  the  Schwarz  theory  for 
elliptic  problems.  Beginning  with  analysis  of  the 
two  point  boundary  value  problem,  we  derive  the 
exact  theory  governing  the  alternating  Schwarz 
method  introduced  in  1869  by  Schwarz  [Sch69]. 
Next  we  consider  the  discrete  Schwarz  procedure 
and  mention  some  known  results  concerning  the 
method  for  unstructured  meshes. 

3.1.1  The  Model  Two  Point  BVP 

Consider  two  point  boundary  value  problem  on 
the  interval  x  £  [0,  L] 

-u"{x)  =  f{x) 

n(0)  =  u{L)  =  0  (3.1) 

which  has  the  solution 

«(ar)=  I  (3.2) 

Jo 

in  terms  of  the  Green’s  function  go(x;^)  defined 
on  that  interval.  Equation  (3.1)  implies  that  for 
0  <  a  <  /3  <  1  the  following  Dirichlet  problems: 

-u"{x)  =  f{x),  xe[0,f3L] 

u(0)  =  0,  u{(JL)  =  u{(JL)  (3.3) 

-u"{x)  =  f{x),  xe[aL,L] 
u{aL)  =  u{aL),  ti(i)  =  0  (3.4) 

Let  gi(x;  ^)  and  g2{x;  ^)  denote  Green’s  functions 
on  the  intervals  [0,/?X]  and  [aL,L]  respectively. 


Using  these  Green’s  functions  we  obtain  the  so¬ 
lution  of  (3.3)  for  X  £  [0,l3L]: 

giix-,0f{0d^  (3.5) 

and  (3.4)  for  x  £  [aL,L]: 

(3.6) 

In  the  following  paragraphs  we  consider  the 
additive  and  multiplicative  Schwarz  algorithms 
which  utilize  (3.3)  and  (3.4). 

The  Additive  Schwarz  Algorithm 

The  basic  idea  in  additive  Schwarz  domain  de¬ 
composition  is  to  consider  the  iteration 

-U"{x)  =  f{x),  0<x<^L 

U'^+\0)  =  0,  (3.7) 

-V'\x)  =  /(x),  aL<x<L 
U''+i(aL)  =  U^(aX),  =  0  (3.8) 

From  Equations  (3.5)  and  (3.6)  it  follows  that 

u‘‘*\x)  =  JlV'‘W  + 1 

=  J^^^V\aL)+  j'' 

(1-ajL  JaL 

where  the  interval  of  validity  is  understood.  Next 
define  the  error  functions 

<i‘«(x)  =  t/W(;r)-»(x)=(^)e‘(/J£) 
e‘+'(i)  =  y‘+i(x)-K(i)=  (ji-^)<i‘(a£). 

(3.9) 
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Clearly  the  error  behaves  linearly  and  of  the  form 

(3.10) 

Substitution  into  (3.9)  yields 


Theorem  3.2  Consider  the  multiplicative 
Schwarz  iteration  given  by  (3.15),  (3.16).  There 
exists  a  constant  C  =  C{U^,V^)  such  that 


sup  —  u(x)|  <  C 

0<x<pL 


(1  -  a)/3y 


k 


(3.17) 


=  (iT^)  ""  (i) 

so  that 


sup  IV*  —  'm(x)|  <  C 

aL<x<L 


fa -/})“]’' 

ui-c.)/jy  • 


(3.18) 


(3.12) 


From  this  we  obtain  a  fundamental  result  in 
domain  decomposition  concerning  the  additive 
Schwarz  algorithm: 

Theorem  3.1  Consider  the  additive  Schwarz 
iteration  given  by  (3.7),  (3.8).  There  exists  a 
constant  C  —  C{U^,V^)  such  that 

sup  |£l«-„(,;)|<c(^i^^)  (3.13) 

0<x<f3L 

sup  |F“  -  u(x)|  <  C  ( (^^^)  "  .  (3.14) 
aL<x<L  V(l-tt)P/ 

The  proof  follows  directly  from  (3.12)  and  (3.9). 

■ 

The  Multiplicative  Schwarz  Algorithm 

Two  subdomain  multiplicative  Schwarz  algo¬ 
rithm  differs  from  the  additive  Schwarz  algo¬ 
rithm  only  in  that  the  subdomains  are  updated 
sequentially,  i.e. 

-U"{x)  =  f{x),  0<x</3L 

C/^+i(0)  =  0,  U'^+\PL)  =  V\l3L)  (3.15) 

-V"{x)  =  f{x),  aL<x<L 
V*+i(aX)  =  ff*+i(aX),y''+i(X)  =  0  (3.16) 

Following  a  similar  analysis  to  the  previous  sec¬ 
tion  we  obtain  the  following  result  concerning 
the  multiplicative  Schwarz  algorithm: 


additive  Schwarz 


muitiplicative  Schwarz 

Figure  3,1:  Comparison  of  the  2  subdomain  ad¬ 
ditive  Schwarz  (top)  and  multiplicative  Schwarz 
(bottom)  algorithms  for  the  two  point  BVP. 

The  theory  clearly  shows  the  favorable  conse¬ 
quences  of  increased  overlap.  Figure  3.1  graphs 
the  error  functions  d{x)  and  e(x)  for  the  two  do¬ 
main  additive  and  multiplicative  Schwarz  algo¬ 
rithms.  As  predicted  by  the  theory,  the  mul¬ 
tiplicative  algorithm  converges  at  a  rate  twice 
that  of  the  additive  algorithm.  Next,  consider 
the  situation  in  which  both  subdomains  are  of 
equal  size  with  overlap  distance  6.  Some  simple 
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algebra  yields  the  following  simple  relationship 
for  the  convergence  parameter: 

(1-tX 

which  is  graphed  in  Figure  3.2.  But  as  Figure  3.3 


1—  — 

Figure  3.2:  Convergence  parameter  for  the 
two  subdomain  Schwarz  iteration  with  equal  sub- 
domain  size  and  overlap  S/L. 


multiplicative  Schwarz 


Figure  3.3:  Comparison  of  the  4  subdomain  ad- 
shows,  the  algorithm  deteriorates  as  the  num-  ditive  Schwarz  (top)  and  multiplicative  Schwarz 
ber  of  subdomains  increases.  This  effect  will  be  (bottom)  algorithms  for  the  two  point  BVP. 
quantified  in  the  next  section. 


3.1.2  The  Discrete  Schwarz  Theory 

In  this  section  we  review  the  Schwarz  theory 
for  discrete  systems  using  the  notation  given  in 
Chan  and  Mathew  [CM94].  Consider  the  sym¬ 
metric  positive  definite  linear  system 


Au  =  f  (3.19) 

obtained  from  the  2-D  discretization  of  an  el¬ 
liptic  equation  on  the  domain  fl  which  consists 
of  two  overlapping  subdomains  and  ^2  such 
that  U  n  O2  0.  Let  1%  and  I2  de¬ 

note  the  set  of  interior  mesh  vertices  contained 
in  and  0,2  respectively.  The  total  number 
of  esh  vertices  is  n  and  the  number  of  interior 
vertices  in  Ii  and  I2  is  ni  and  n2-  Next  define 
the  zero  extension  matrices  Rf  G  for  each 

subdomain  such  that  for  Xi  G  3?"^' 


The  local  subdomain  matrices  are  then  given  by 
Ai  =  RiARf.  The  discrete  form  of  the  alter¬ 
nating  Schwarz  procedure  produces  the  following 
sequence  of  iterates 

y^k+i/2  ^  u'^  +  RjA:[^Ri{f-  Au'^) 

uk+l  =  + j?Jyl-li22(/-Au*+l/2) 

(3.21) 

Defining  the  matrices 

Pi  =  RjAj'^RiA  (3.22) 

the  convergence  is  governed  by  the  iteration  ma¬ 
trix  (7  —  P2)(7  —  Pi).  This  motivates  the  term 
multiplicative  Schwarz  iteration.  Similarly,  the 
sequence  of  iterates 

y^k+1/2  ^  +  RjA:[^Ri(f  -  Au'^) 

yfc+i  ^  u*+i/2  +  i2^A2^P2(/- A«^) 

(3.23) 
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produces  the  combined  additive  Schwarz  scheme  3.1.3  Interface  Substructuring 

/T-i  A  /t  A  number  of  algorithms  exist  in  domain  decom- 

u  =  tifc  +  {^Ri  Ri  +  R2  A2  R2)  (/  -  n  ^iiich  exhibit  superior  conditioning  to 

=  M~^(f  -  Au^)  that  given  in  Equation  (3.27)  while  still  main- 

(3  parallel  scalability.  The  lectures  by  Pro¬ 

fessor  Farhat  will  describe  the  two-level  Finite 

so  that  the  convergence  is  governed  by  the  sum  Element  Tearing  and  Interconnectivity  (FETI) 
— 1  A  _  D  I  E>_  G/'tiiiroi'r?  talrrn-  luethod  which  has 


M~gA  =  Pi  +  P2.  The  additive  Schwarz  algo-  metnoa  wnicn  nas 

rithm  is  appealing  in  parallel  computation  since  k{M~'^A)  <c(i  +  log^(.ff/h))  (3.28) 

each  subdomain  solve  can  be  done  in  parallel.  ^ 

Unfortunately  the  performance  of  the  algorithm  conditioning  properties  for  self-adjoint  equations 


deteriorates  as  the  number  of  subdomains  in-  on  meshes  i 
creases.  This  effect  was  observed  in  Figure  3.3.  Lagrange  n 
Let  H  denote  the  characteristic  size  of  each  sub-  subdomains 
domain,  6  the  overlap  distance,  and  h  the  mesh  boundaries, 
spacing.  The  condition  number  k  of  M~J^A  is  Other  in1 
given  in  the  following  theorem:  matrix  unk 


on  meshes  with  element  size  h.  In  this  method, 
Lagrange  multipliers  are  introduced  to  couple 
subdomains  and  ensure  continuity  at  interface 


spacing.  The  condition  number  k  of  is  Other  interface  strategies  begin  by  ordering 

given  in  the  following  theorem:  matrix  unknowns  in  each  subdomain  first  fol¬ 

lowed  by  interface  unknowns  as  shown  in  Figure 
Theorem  3.3  There  exists  a  constant  C  inde-  3.4.  This  matrix  ordering  can  be  represented  by 
pendent  of  H  and  h  such  that 


(3-25) 


Proof:  Given  in  [DW89]  [DW92].  ■ 

This  theorem  describes  the  deterioration  as  the 
number  of  subdomains  increases  (and  H  de¬ 
creases).  With  some  additional  work  this  de¬ 
terioration  can  be  removed  by  the  introduction 
of  a  global,  coarse  subspace  and  a  global  restric¬ 
tion  matrix  Rh  ^  The  two  /ewe/ additive 

Schwarz  matrix  for  p  subdomains  becomes 

V.-,'  =  RJ,A«Rh  +  i^RjA-^Ri-  (3-26) 

i=l 

Under  the  assumption  of  “generous  overlap”  we 
have  the  following  result: 

Theorem  3.4  There  exists  a  constant  C  inde¬ 
pendent  of  H  and  h  such  that 

k{M-'A)  <  C  (1  +  (f ))  •  (3.27) 

Proof:  See  [DW89]  [DW92]  and  Chan  and  Zou 
[CZ93].  ■ 


Figure  3.4:  Arbitrary  domain  with  subdomains 
1  —  4  and  interfaces  5  —  9. 


the  following  partitioned  matrix  equation: 


li  -^2  \  A  _  /  \ 
I3  A4  y  \  X2  /  \  &2  / 


Next  consider  the  2x2  inverse 


A-^  = 


Cl  C2 
C3  C4 


where 


Cl  =  A-i  + 

C2  =  -A-IA25-' 

Ga  = 

S  =  A4  —  A3A2  ^A2 
C4  =  S-\ 
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In  practice,  we  do  not  require  the  explicit  in¬ 
verse  of  the  matrix  A  but  rather  the  ability  to 
apply  A~^  to  an  arbitrary  vector.  In  this  con¬ 
text  the  Schur  complement  strategy  requires  the 
ability  to  solve  matrix  systems  for  A\  and  the 
Schur  complement  matrix  S.  Observe  that  the 
matrix  Ai  consists  of  decoupled  subdomain  ma¬ 
trices  which  can  be  solved  independently  (in  par¬ 
allel).  The  Schur  complement  matrix  provides  a 
form  of  global  coupbng  of  interface  unknowns. 
In  Smith  [Smi92],  the  Schur  complement  form 
of  the  equations  is  considered  together  with  a 
vertex,  edge,  and  face  space  decomposition  thus 
producing  a  coarse-fine  space  algorithm  with 

k{M-^A)  <  C  (1  -f  \oz{Hl8)) .  (3.29) 


multiplicative  Schwarz  type  algorithm 

A  .  =  / 

with  boundary  conditions 

-hp-uU,  rnri 

u =  p+  u J  -f  p“  ,  Ti  n 

(3.33) 

followed  by 

A  •  =  / 

with  boundary  conditions 

rnr2 

« J+i  =  P+ \q2+P~ ,  Ta  n  III . 

(3.34) 


conditioning.  A  complete  discussion  of  these  ad¬ 
vanced  techniques  is  beyond  the  scope  of  these 
notes.  Further  details  and  references  can  be 
found  in  Chan  and  Mathew  [CM94]  and  TaUec 
[Tal94]. 

3.1.4  Boundary  Condition  Admissi¬ 
bility  and  Hyperbolic  Equations 

Consider  the  model  d-dimensional  hyperbolic 
differential  equation 

A(x)  •  Vu(x)  =  /(x),  X  6  3?*^  (3.30) 


An  additive  Schwarz-type  algorithm  can  be  sim¬ 
ilarly  posed  [Qua90].  In  the  next  section,  we 
wiU  show  that  numerical  schemes  based  on  up¬ 
wind  differencing  naturally  inherit  the  admissi¬ 
bility  condition  so  that  Dirichlet  overlap  con¬ 
ditions  can  be  imposed  even  in  the  hyperbolic 
limit. 

3.1.5  Numerical  Admissibility  for  Hy¬ 
perbolic  Equations 

Consider  the  model  advection  equation 

ut  +  c(x)ux  =  0,  0  <  X  <  L  (3.35) 


in  a  domain  12  with  boundary  F  and  boundary 
normal  n.  Next  subdivide  F  into  segments  F” 
and  F+  associated  with  incoming  (A  •  n  <  0) 
and  outgoing  ((A  •  n  >  0))  characteristics.  The 
admissibility  condition  requires  that 

u  —  w|n,  A  •  n  >  0 
“  =  «|oo?  A  •  n  <  0  (3.31) 

In  terms  of  the  characteristic  projectors  = 
^[l-f-sign(A-n)]  the  boundary  condition  becomes 

u  =  p'^u\q  +  p~u\oo.  (3.32) 

Next  consider  a  two  subdomain  overlapped  prob¬ 
lem,  12  =  f2j  U  122  with  f2i  n  122  ^  0-  The  hyper¬ 
bolic  form  of  the  equations  suggests  the  following 


together  with  the  prototypical  differencing 
scheme 


(y .)^  +  ^  0,  j  =  0,l,...,A 

(3.36) 

defined  on  the  mesh  xj  =  jAx  with  Ax  =  L/N. 
Next  consider  the  following  conventional  upwind 
flux  function  for  the  interface  position  located 
midway  between  Xj  and  x_,+i 


(3.37) 

After  some  manipulation,  this  flux  can  be  placed 
in  the  following  revealing  form 


^j-f-i/2  —  ^  T  2^j-t-i/2^7+i)  (3.38) 
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with  P%xl2  =  Ml  +  sign(Cj+i/2)].  If  we  impose 
a  Dirichlet  condition  on  «  at  a;  =  £  for  c(£)  >  0 
this  would  normally  lead  to  an  ill-posed  hyper¬ 
bolic  problem.  But  by  using  (3.37)  we  see  that 
numerically  the  ill-posed  data  is  ignored  by  the 
scheme.  More  generally  we  make  the  following 
observation: 

The  use  of  upwind  flux  functions  such  the 
Roe’s  flux  function  [RoeSl]  permits  the  nu¬ 
merical  overspeciflcation  of  boundary  data. 

In  the  strong  solution  limit,  the  character¬ 
istic  nature  of  the  flux  function  correctly 
ignores  all  ill-posed  boundary  data. 

This  remarkable  property  greatly  simplifies  the 
implementation  of  Schwarz  schemes  for  hyper- 
boUc  equations.  The  only  complication  that 
arises  concerns  higher  order  schemes  in  which  the 
flux  formula  takes  a  slightly  more  general  form 

^i+i/2  = 

(3.39) 

where  u^  and  denote  states  obtained  from 
higher  order  reconstruction  and  extrapolation. 
For  example  the  extrapolation  formulas 

/\  y 

'^i+1/2  ~  ^i+1  ~  (^a;“)j-|-l  (3.40) 

would  require  having  values  of  the  solution  gra¬ 
dient  SxU  at  subdomain  boundaries.  A  one¬ 
sided  approximation  could  be  made  but  would 
lead  to  an  inconsistency  in  residuals  at  mesh 
vertices  distance- 1  from  subdomain  boundaries. 
The  alternative  is  to  compute  numerical  gradi¬ 
ents  in  each  subdomain  followed  by  at  exchange 
at  subdomain  boundaries  as  shown  in  Figure  3.5. 
When  gradient  data  is  exchanged  in  this  way,  the 
final  solution  obtained  in  each  subdomain  will  be 
identical  to  a  single  domain  computation. 

3.1.6  Numerical  Results  for  Additive 
and  Multiplicative  Schwarz  Iter¬ 
ation 

The  following  paragraphs  present  sample 
Schwarz  calculations  for  the  inviscid  and  vis¬ 
cous  flow  test  cases  given  in  Sections  2.4.1  and 


Figure  3.5:  Strategy  for  exchanging  boundary 
gradients  prior  to  flux  computation. 


2.4.2.  In  these  calculations,  we  will  consider 
overlap  distances  of  1,2,  and  3  as  partially  de¬ 
picted  in  Figures  3.6  and  3.7.  Note  that  we  have 
not  included  a  “coarse  space”  correction  to  the 
Schwarz  method.  Consequently,  we  expect  to 
see  a  degradation  in  the  method  as  the  number 
of  partitions  increases. 


Figure  3.6:  A  triangulated  and  partitioned  do¬ 
main  exhibiting  distance- 1  overlap 


Test  Case  1  (Inviscid  Flow) 

Using  the  mesh  and  flow  conditions  given  in 
Section  2.4.1  inviscid  flow  was  computed  about 
the  multi-element  airfoil  geometry.  Figures  3.8- 
3.11  graph  convergence  histories  for  multiplica¬ 
tive  and  additive  Schwarz  iterations. 

Each  graph  contains  data  for  overlap  distances 


20  40  60  80 

Schwarz  Iterations 


60  80 


Schwarz  Iterations 


Figure  3.8:  Case  1  Inviscid  Flow.  Variation  in  Figure  3.10:  Case  1  Inviscid  Flow.  Varijation 
Overlap  For  Multiplicative  Schwarz  with  1st  or-  in  Overlap  For  Additive  Schwarz  with  1st  order 
der  discretization.  discretization. 


of  1,2,  and  3  and  domains  partitioned  into  2  and  each  subdomain  problem  need  only  be  solved  to 
8  subdomains.  Using  a  variant  of  the  unipro-  some  reasonable  level  of  convergence.  This  re- 
cessor  algorithm  described  Chapter  2,  each  sub-  suits  in  a  tremendous  savings  in  computation 
domain  problem  is  solved  “exactly” .  In  reality,  time.  The  graphs  shown  on  the  right  represent 
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Figure  3.11:  Case  1  Inviscid  Flow.  Variation  in 
Overlap  For  Additive  Schwarz  with  2nd  order 
discretization. 

computations  using  the  second  order  accurate 
finite- volume  scheme  with  linear  reconstruction. 
The  graphs  shown  on  the  left  represent  compu¬ 
tations  using  a  first  order  accurate  finite- volume 
scheme  which  only  requires  distance- 1  data  on 
the  triangulation.  The  basic  trends  show  a  no¬ 
ticeable  degradation  in  the  convergence  rate  with 
increased  partitioning  and  a  mild  improvement 
with  increased  overlap. 

Test  Case  2  (Viscous  Turbulent  Flow) 

Using  the  mesh  and  flow  conditions  given  previ¬ 
ously  in  Section  2.4.2  viscous  turbulent  flow  was 
computed  about  the  multi-element  airfoil  geome¬ 
try.  Figure  3.12  graphs  convergence  histories  for 
computations  using  distance  1,2,  and  3  overlap 
on  a  4  subdomain  partitioning.  The  improve¬ 
ment  with  increased  overlap  is  rather  slight.  In 
Figures  3.14-3.18  the  mesh  and  Mach  number 
contours  for  a  subdomain  near  the  leading  edge 
of  the  main  element  are  shown  for  Schwarz  iter¬ 
ations  1,3, 5, 7,  and  40.  Note  that  at  iteration 
7  the  solution  visually  appears  quite  close  to  its 
final  value.  Even  so,  the  number  of  iterations 
required  to  reach  a  comfortable  level  of  conver- 


Figure  3.12:  Residual  history  for  Case  2  viscous 
airfoil  using  multiplicative  Schwarz  iterations  on 
4  partitions. 


Figure  3.13:  Mesh  partition  for  multi-element 
viscous  flow  computation. 


gence  is  relatively  large  when  compared  to  the 
uniprocessor  Newton  scheme. 
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Figure  3.14:  Snapshot  of  solution  isomach  con-  Figure  3.16:  Solution  isomach  contour  snapshots 
tours  at  iteration  1.  at  iteration  5. 


Figure  3.15:  Solution  isomach  contour  snapshots 
at  iteration  3. 


Figure  3.17:  Solution  isomach  contour  snapshots 
at  iteration  7. 


3.2  Newton’s  Method  with 
Schwarz  Preconditioning 

CGSTAB  methods.  When  used  as  a  precondi- 
Next  we  consider  using  the  additive  Schwarz  tioner,  some  flexibility  and  compromises  can  be 
precedure  to  precondition  the  GMRES  and  Bi-  made  which  can  lead  to  reduced  execution  times: 
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Figure  3.18:  Solution  isomach  contour  snapshots 
at  final  iteration. 


1.  Increased  Sparsity  Preconditioner.  This  is 
a  common  technique  for  higher  order  dis¬ 
cretization  methods.  In  the  present  second 
order  finite-volume  discretization,  the  Jaco¬ 
bian  matrix  contains  distance-2  nonzero  en¬ 
tries.  For  purposes  of  preconditioning  only, 
the  Jacobian  matrix  associated  with  a  lower 
(first)  order  discretization  is  used. 

2.  Inexact  Matrix  Restriction.  Exact  matrix 
restriction  performs  the  task  of  extracting 
local  submatrices  (a  gather,  scatter  opera¬ 
tion) 

Ai  =  RiARj.  (3.41) 

In  some  parallel  implementations  not  aU 
data  for  this  calculation  is  processor  resi¬ 
dent.  This  implies  communication  overhead 
if  an  exact  computation  is  to  be  achieved.  In 
the  next  section  a  3-D  parallel  implementa¬ 
tion  is  described  in  which  the  mesh  contains 
no  overlap,  yet  through  communication  the 
scheme  performs  Schwarz  preconditioning 
exactly  equivalent  distance-2  overlap.  This 
is  sometimes  referred  to  as  “implicit”  over¬ 
lap.  Using  implicit  overlap,  a  compromise 
is  possible  in  the  formation  of  subdomain 


preconditioning  matrices  by  neglecting  off- 
processor  contributions  to  on-processor  ma¬ 
trix  elements.  We  refer  to  this  as  “inexact 
matrix  restriction.” 


3.2.1  Case  1  (Inviscid  flow)  Numerical 
Results. 

Using  the  mesh  and  flow  conditions  given  in  Sec¬ 
tion  2.4.1,  inviscid  flow  is  computed  on  a  4  sub- 
domain  mesh  using  the  Schwarz-type  precondi¬ 
tioning  of  GMRES(20)  iterations.  All  compu¬ 
tations  were  performed  using  a  Newton  matrix 
corresponding  to  a  CEL  number  of  approx  10®. 
Figures  3.19  and  3.20  demonstrate  the  viability 
of  using  inexact  matrix  restriction.  In  addition, 
this  figure  shows  the  degradation  in  convergence 
due  to  the  use  of  a  lower  order  accurate  precon¬ 
ditioning  matrix  and  the  enhancement  in  conver¬ 
gence  with  increased  overlap.  Figure  3.21  shows 
the  mild  effect  of  increasing  the  number  of  mesh 
subdomains  (4,8,16). 


Figure  3.19:  Convergence  of  GMRES(20).  Effect 
of  boundary  fill  strategies  on  4  partition  mesh 
with  unit  overlap. 
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Figure  3.20:  Convergence  of  GMRES(20).  Effect 
of  increasing  overlap. 


Figure  3.21:  Convergence  of  GMRES(20).  Effect 
of  increasing  number  of  partitions,  unit  overlap, 
distance  2  preconditioning 


3.2.2  Case  2  (Viscous  flow)  Numerical 
Results. 

Using  the  mesh  and  flow  conditions  given  in  Sec¬ 
tion  2.4.2,  viscous  turbulent  flow  is  computed 


on  a  4  subdomain  mesh  using  the  Schwarz- 
type  preconditioning  and  GMRES(30)  itera¬ 
tions.  All  computations  were  performed  us¬ 
ing  a  Newton  matrix  corresponding  to  a  CFL 
number  of  10®.  Figure  3.22  shows  the  result¬ 
ing  convergence  histories  for  the  GMRES  calcu¬ 
lation  using  single  and  4  partitions  as  well  as 
distance- 1  and  distance-2  preconditioning  matri¬ 
ces.  The  distance-2  preconditioning  works  very 
weU  for  this  problem.  The  distance- 1  precondi¬ 
tioning  initially  reduces  the  matrix  residual  norm 
rapidly  but  then  reverts  to  a  much  slower  rate  of 
convergence. 


Figure  3.22:  Case  2  (viscous  flow).  Convergence 
of  GMRES(30).  Effect  of  increasing  number  of 
partitions,  unit  overlap 


3.3  Some  3-D  Computations 
on  the  IBM  SP2 

Our  current  platform  for  parallel  computation  at 
the  NASA  Ames  Research  Center  is  the  IBM  SP2 
computer.  The  current  configuration  consists  of 
160  rack-mounted  IBM  590  workstations  with 
total  memory  capacity  exceeding  20  gigabytes. 
Each  processor  has  a  peak  theoretical  speed  of 
250  megaflops.  For  these  computations  a  sin¬ 
gle  processor  attains  a  sustained  speed  of  about 
55  megaflops.  The  processors  are  interconnected 
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via  a  fast  network  switch  with  measured  band¬ 
width  of  approximately  33  megabytes/second 
and  a  measured  latency  of  about  45  microsec¬ 
onds.  For  maximum  portability  MPI  message 
passing  protocol  has  been  chosen  for  implemen¬ 
tation  of  the  parallel  Newton  algorithms. 

The  overall  strategy  in  the  parallel  implemen¬ 
tation  is  to  reduce  the  entire  algorithm  to  a  se¬ 
quence  of  steps  requiring  only  distance-one  in¬ 
formation  on  the  triangulation.  The  greatly 
simplifies  the  implementation  of  the  algorithm 
while  stiU  replicating  uniprocessor  results.  The 
implementation  contains  several  algorithmic  ele¬ 
ments.  Each  of  these  elements  will  be  described 
in  the  following  sections  and  elucidated  using  the 
realistic  example  of  fluid  flow  over  a  multiple- 
component  wing  geometry.  The  wing  geometry, 
symmetry  plane  mesh,  and  Mach  contours  at  a 
midspan  cutting  plane  are  shown  in  Figure  3.23. 


Figure  3.23:  Inviscid  flow  Moo  =  -20, 0  =  0  over 
a  multiple-component  wing  geometry  (600,000 
degrees  of  freedom). 


Mesh  Partitioning 

In  the  3-D  parallel  algorithm,  the  mesh  is  a  priori 
partitioned  into  N  nonoverlapping  subdomains, 
each  of  which  resides  on  one  of  N  processors. 


More  precisely,  mesh  volumes  (tetrahedra,  hex- 
ahedra,  prisms,  etc)  lie  entirely  on  a  given  parti¬ 
tion,  triangulation  vertices  are  repeated  on  par¬ 
tition  boundaries,  and  control  volumes  for  the 
finite- volume  scheme  span  partition  boundaries. 
The  situation  is  depicted  in  Figure  3.24.  In 
general  we  find  that  the  spectral  partitioning 
method  outperforms  the  others  but  at  a  higher 
partitioning  cost.  Figure  3.25  shows  the  mesh 
subdivisions  (bold  lines)  induced  by  a  spectral 
partitioning  on  the  midspan  cutting  plane. 


Figure  3.24:  Portion  of  mesh  spanning  partition 
boundary  showing  control  volume  subdivision. 


Figure  3.25:  Mach  Contours  and  Partition 
Boundary  (bold  lines). 


Computation  of  the  Explicit  Residual  and 
Reconstruction  Gradients 

For  control  volumes  completely  contained  within 
a  single  partition  domain,  the  calculation  of  the 
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residual  is  identical  to  the  uniprocessor  compu¬ 
tation.  For  the  control  volumes  subdivided  by 
partition  boundaries,  integral  conservation  im¬ 
plies  that 

f  (f-n)  dS  =  f  ({■n)dS-\-f  {{■n)dS. 

«/f2o  •/fii 

(3.42) 

In  Figure  3.24,  JIq  and  Qi  correspond  to  the 
control  volume  portions  on  processors  0  and  1 
respectively  such  that  Q  =  fto  U  J2i.  There¬ 
fore  residuals  can  be  computed  on  a  processor- 
by-processor  basis  followed  by  an  exchange  and 
sum  of  residuals  on  interprocessor  boundaries. 
This  yields  results  identical  to  that  obtained  on 
a  uniprocessor  mesh.  The  least-squares  recon¬ 
struction  technique  also  extends  in  a  similar  way 
if  a  bit  mask  is  assigned  to  all  edges  in  the  mesh 
so  that  edges  lying  on  processor  boundaries  con¬ 
tribute  only  once  to  the  accumulation  formulas. 

The  GMRES  Algorithm 

The  GMRES  algorithm  requires  three  basic  op¬ 
erations:  vector  inner  products,  matrix-vector 
products,  and  preconditioning.  The  parallel  im¬ 
plementation  of  each  of  these  is  described  below. 
In  our  implementation  of  GMRES  aU  processors 
solve  the  same  small  least-squares  problem.  This 
redundancy  is  of  minor  consequence. 

Vector  Inner  Products 

Redundancy  of  boundary  vertices  in  vector  in¬ 
ner  products  is  ebminated  with  a  mask  bit  pre¬ 
assigned  to  each  vertex.  The  actual  inner  prod¬ 
uct  is  calculated  by  a  local  masked  inner  prod¬ 
uct  followed  by  a  global  summation  reduction 
(MPI-REDUCE). 

Matrix- Vector  Products 

Previously,  we  discussed  several  strategies  for 
computing  matrix- vector  products  in  the  unipro¬ 
cessor  case.  If  Frechet  approximate  derivatives 
are  used,  then  the  procedure  is  straightforward 
and  uses  exactly  the  same  communication  steps 
needed  in  computing  the  expUcit  residual.  If  ex¬ 
act  matrix- vector  products  are  desired,  we  store 
only  the  matrix  associated  with  the  distance- 
one  neighbors  on  the  triangulation  and  compute 


the  remaining  terms  in  a  matrix-free  way  as  dis¬ 
cussed  in  Chapter  2. 

Processor  Local  ILU(O)  Preconditioning 

Our  preconditioning  matrix  for  the  GMRES 
solver  is  based  on  the  Jacobian  matrix  of  the 
first-order  accurate  spatial  discretization.  This 
matrix  has  nonzero  entries  placed  at  distance- 
one  locations  in  the  connectivity  graph.  In  a  de¬ 
parture  from  the  uniprocessor  code,  we  compute 
and  store  on  a  processor  nonzero  entries  in  the 
matrix  associated  with  mesh  vertices  residing  on 
that  processor.  As  a  second  step,  the  diagonal 
matrix  blocks  corresponding  to  interprocessor 
boundary  vertices  are  exchanged  and  summed. 
This  yields  diagonal  block  entries  in  the  result¬ 
ing  processor  local  matrix  that  are  identical  to 
the  corresponding  uniprocessor  matrix.  At  the 
cost  of  increased  interprocessor  boundary  vertex 
communication,  aU  processor  local  matrix  entries 
could  be  made  identical  to  the  uniprocessor  ma¬ 
trix  entries.  The  processor  local  matrix  is  ILU 
factored  and  used  for  preconditioning  GMRES 
iterations.  If  these  matrix  entries  are  retained 
then  a  unique  value  must  be  obtained  from  a 
linear  combination  of  the  multiple  computed  val¬ 
ues.  Our  experience  has  shown  that  the  local 
processor  preconditioning  does  not  significantly 
impact  the  effectiveness  of  the  ILU  precondition¬ 
ing.  In  Figure  3.26,  we  show  the  convergence  of 
GMRES(12)  with  local  ILU(O)  preconditioning 
on  16,  32,  and  64  processors  for  the  multiple- 
component  wing  calculation  at  a  CEL  number 
of  about  20000. 

Keep  in  mind  that  this  departure  from  the 
uniprocessor  algorithm  only  affects  the  GMRES 
convergence  and  not  the  convergence  of  New¬ 
ton’s  method.  Figure  3.27  shows  the  conver¬ 
gence  history  of  the  Newton  scheme  for  the  first 
and  second  order  accurate  spatial  discretization 
schemes. 

3.3.1  Scalability 

The  scalability  of  the  current  parallel  algorithm 
on  the  IBM  SP2,  while  not  excellent  is  certainly 
acceptable.  This  is  particularly  true  since  the 
parallel  algorithm  retains  the  favorable  qualities 
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Figure  3.26:  GMRES  Iterations  (restarts)  re¬ 
quired. 


Figure  3.27:  Convergence  History  for  Inviscid 
Multiple  Component  Wing  Case  (16  Processors) 


of  the  uniprocessor  algorithm,  such  as  Newton- 
hke  convergence.  Furthermore,  because  the  par¬ 
allel  algorithm  makes  very  few  compromises  in 
implementing  the  uniprocessor  algorithm,  the 
primary  contribution  to  the  degradation  of  scala- 
bihty  is  the  time  taken  by  interprocessor  commu¬ 
nication.  This  implies  that  the  scalability  would 
be  better  on  parallel  computers  with  faster  in- 


Table  3.1:  Wallclock  Time  in  Minutes  for  the 
Multiple- Component  Wing  Calculation 


if  Procs 

First  Order 
Accurate  Scheme 

Second  Order 
Accurate  Scheme 

16 

50.0 

176.0 

32 

31.0 

103.0 

64 

15.6 

55.0 

terprocessor  communication.  Figure  3.28  shows 


Figure  3.28:  Relative  speedup  of  parallel  com¬ 
putations  using  16,  32,  and  64  processor.  Each 
speedup  is  normaUzed  by  the  16  processor  value. 


the  relative  speedup  of  computations  on  the  IBM 
SP2  for  16,  32,  and  64  processors,  for  both  the 
first  order  and  second  order  schemes.  Once 
again,  the  problem  being  solved  is  the  inviscid 
flow  about  a  multiple  component  wing,  as  de¬ 
scribed  above.  The  speedups  are  normalized  by 
the  16  processor  value,  since  the  memory  require¬ 
ments  made  16  processors  a  minimum  require¬ 
ment  to  run  the  problem.  The  Table  3.1  shows 
the  total  wallclock  time  in  minutes  taken  by  the 
runs  corresponding  to  Figure  3.28.  In  each  case 
the  second  order  scheme  takes  roughly  3.5  times 
as  long  as  the  first  order  scheme. 
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3.3.2  3-D  Parallel  Computation  Re¬ 

sults 

In  this  section  we  will  present  results  for  a  tur¬ 
bulent  viscous  computation  about  the  multiple 
component  wing  described  above.  The  Mach 
number  of  the  run  is  0.2,  with  a  Reynolds  num¬ 
ber  of  5  MiUion.  The  angle  of  attack  is  8°.  For 
this  run,  64  processors  are  used.  To  minimize 
storage,  the  Jacobian  matrices  are  stored  using 
single  precision  (32  bits  on  the  SP2),  although 
aU  floating  point  operations  are  stiU  performed 
in  double  precision. 

The  tetrahedral  mesh  about  the  body  has 
roughly  400, 000  vertices  and  over  2, 000, 000 
tetrahedra.  Because  of  the  need  to  resolve  the 
turbulent  boundary  layer,  the  mesh  is  highly 
stretched  near  the  wall,  with  ceU  aspect  ratios 
of  more  than  10, 000.  The  mesh  on  the  center- 
line  plane  is  seen  in  Figure  3.29. 


Figure  3.29:  Viscous  turbulent  flow  {Moo  = 
.20,0  =  8°,  Re  =  5  Million)  over  a  multiple- 
component  wing.  Mach  contours  are  shown  on 
the  midspan  cutting  plane. 


The  Spalart  and  AUmaras  turbulence  model 
[SA92]  is  used  to  simulate  the  effect  of  turbulence 
on  the  mean  flow  equations.  Although  the  basic 
flow  equations  are  solved  using  linear  reconstruc¬ 


tion,  the  turbulence  model  equation  is  solved  us¬ 
ing  only  first  order  advection.  This  is  a  common 
procedure  used  to  increase  robustness,  even  in 
structured  mesh  codes.  The  turbulence  model  is 
fully  coupled  with  the  flow  equations  in  comput¬ 
ing  the  Jacobians.  This  insures  that  Newton’s 
method  is  approached  at  large  timesteps  even 
for  turbulent  computations. 

Figure  3.30  shows  the  resulting  Mach  contours 
on  a  cutting  plane  placed  at  approximately  mid¬ 
span.  Note  the  qualitative  agreement  between 
these  results  and  those  obtained  by  the  corre¬ 
sponding  two-dimensional  computation  shown  in 
Chapter  2,  albeit  at  a  different  angle  of  attack. 


Figure  3.30:  Mach  Contours  on  the  midspan  cut¬ 
ting  plane. 


In  Figure  3.31,  contours  of  the  eddy  viscosity¬ 
like  turbulence  parameter  i>  defined  earlier  are 
depicted  on  the  mid-span  cutting  plane.  Note 
the  high  levels  generated  downstream  of  the 
main  wing  element  over  the  aft  flaps. 

Presently,  this  computation  takes  about  10 
minutes  per  step,  and  about  80  steps  to  con¬ 
verge  to  steady-state  (a  relatively  large  num¬ 
ber  for  Newton’s  method).  This  is  due  to  the 
slow  development  of  turbulence  over  the  wing. 
This  situation  is  likely  to  improve  in  the  near 
future,  as  we  refine  our  technique  for  approach¬ 
ing  steady-state  and  compute  on  a  sequence  of 


Figure  3.31:  Turbulence  quantity  contours  show¬ 
ing  the  buildup  of  turbulence  over  the  aft  flap. 


coarser  meshes  to  accelerate  the  removal  of  the 
initial  transient. 
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SUMMARY 

Aeroelasticity  studies  the  mutual  interaction 
between  aerodynamic  and  elastic  forces  for  an 
aerospace  vehicle.  A  flexible  aircraft  structure 
immersed  in  a  flow  is  subjected  to  surface  pres¬ 
sures  induced  by  that  flow.  If  the  incident 
flow  or  boundary  conditions  are  unsteady,  these 
pressures  become  time-dependent.  Moreover, 
structural  dynamic  motions  induced  by  these 
pressures  in  turn  change  the  boundary  condi¬ 
tions  of  the  flow.  The  accurate  prediction  of 
aeroelastic  phenomena  such  as  divergence  and 
flutter  is  essential  in  the  design  of  high  perfor¬ 
mance  and  safe  aircrafts.  This  prediction  re¬ 
quires  solving  simultaneously  the  coupled  fluid 
and  structural  equations  of  motion.  Therefore, 
numerical  aeroelastic  simulations  are  in  general 
resource  intensive.  They  belong  to  the  fam¬ 
ily  of  Grand  Challenge  engineering  problems, 
and  as  such,  can  benefit  from  the  parallel  pro¬ 
cessing  technology.  This  paper  highlights  some 
important  aspects  of  nonlinear  computational 
aeroelasticity.  These  include  a  three-field  ar¬ 
bitrary  Lagrangian-Eulerian  (ALE)  finite  ele¬ 
ment/volume  formulation  for  coupled  transient 
aeroelastic  problems,  a  rigorous  derivation  of 
geometric  conservation  laws  (GCLs)  for  flow 
problems  with  moving  boundaries  and  unstruc¬ 
tured  deformable  meshes,  the  design  of  a  family 


of  staggered  procedures  for  the  efficient  solu¬ 
tion  of  the  coupled  fluid/structure  partial  dif¬ 
ferential  equations,  and  fast  parallel  domain 
decomposition  solvers.  The  derivations  of  the 
GCLs  are  presented  for  ALE  based  finite  vol¬ 
ume  formulations  as  well  as  ALE  based  stabi¬ 
lized  finite  element  methods.  The  impact  of 
these  GCLs  on  the  numerical  algorithms  used 
for  time-integrating  the  semi-discrete  equations 
governing  the  structural  and  fluid  mesh  mo¬ 
tions  is  also  discussed.  The  solution  of  the 
governing  three-field  equations  with  mixed  im¬ 
plicit/implicit  and  explicit/implicit  staggered 
procedures  are  analyzed  with  particular  ref¬ 
erence  to  accuracy,  stability,  subcycling,  dis¬ 
tributed  computing,  I/O  transfers,  and  paral¬ 
lel  processing.  A  general  and  flexible  frame¬ 
work  for  implementing  the  partitioned  analysis 
of  coupled  transient  aeroelastic  problems  with 
non-matching  fluid/structure  interfaces  on  het¬ 
erogeneous  and/or  parallel  computational  plat¬ 
forms  is  also  described.  This  framework  and  the 
staggered  solution  procedures  are  demonstrated 
with  examples  ranging  from  the  numerical  in¬ 
vestigation  on  an  iPSC-860  massively  parallel 
processor  of  the  instability  of  flat  panels  with 
infinite  aspect  ratio  in  supersonic  airstreams, 
to  the  solution  on  the  Paragon  XP/S,  Cray 


Paper  presented  in  an  AGARD-FDP-VKI  Special  Course  on  “Parallel  Computing  in  CFD”,  held  at  the  VKI,  Rhode-Saint-Genese,  Belgium, 
from  15-19  May  1995  and  16-20  October  1995  at  NASA  Ames,  United  States  and  published  in  R-807. 
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T3D  and  IBM  SP2  parallel  systems  of  three- 
dimensional  wing  response  problems  in  the 
transonic  regime. 

1.  INTRODUCTION 

Aeroelasticity  is  the  study  of  the  effect  of  aero¬ 
dynamic  forces  on  elastic  bodies.  Because  these 
effects  have  a  great  impact  on  performance 
and  safety  issues,  aeroelasticity  has  rapidly  be¬ 
come  one  of  the  most  important  considerations 
in  aircraft  design.  The  basic  mechanism  of  a 
fluid/structure  interaction  phenomenon  can  be 
simply  explained  as  follows.  The  aerodynamic 
forces  acting  on  an  aircraft  depend  critically  on 
the  attitude  of  its  lifting  body  with  respect  to 
the  flow,  which  in  turn  depends  on  the  flexibil¬ 
ity  of  the  aircraft.  Therefore,  the  elastic  defor¬ 
mations  of  a  structure  play  an  important  role 
in  determining  its  external  loading.  Since  the 
magnitude  of  the  aerodynamic  forces  cannot  be 
known  until  the  elastic  deformations  are  first 
determined,  it  follows  that  the  external  load 
cannot  be  evaluated  until  the  coupled  aeroelas- 
tic  problem  is  solved. 

In  general,  aeroelastic  problems  are  di¬ 
vided  into:  (a)  stability,  and  (b)  response  prob¬ 
lems.  Each  of  these  two  classes  can  be  further 
classified  into  steady-state  or  static  problems 
in  which  the  inertia  forces  may  be  neglected, 
and  unsteady,  or  dynamic,  or  transient  prob¬ 
lems  which  are  characterized  by  the  interplay 
of  all  of  the  aerodynamic,  elastic,  and  inertia 
forces.  Throughout  this  paper,  we  focus  exclu¬ 
sively  on  dynamic  aeroelasticity  problems. 

If  one  notes  that  the  external  aerodynamic 
forces  acting  on  an  aircraft  structure  increase 
rapidly  with  the  flight  speed,  while  the  in¬ 
ternal  elastic  and  inertial  forces  remain  essen¬ 
tially  unchanged,  one  can  easily  imagine  that 
there  may  exist  a  critical  flight  speed  at  which 
the  structure  becomes  unstable.  Such  insta¬ 
bility  may  cause  excessive  structural  deforma¬ 
tions  and  may  lead  to  the  destruction  of  some 


components  of  the  aircraft.  Panel  or  wing  flut¬ 
ter,  which  is  a  sustained  oscillation  of  panels  or 
wings  caused  by  the  high-speed  passage  of  air 
along  the  panel  or  around  the  wing,  is  an  ex¬ 
ample  of  such  instability  problems.  Buffeting, 
which  is  the  unsteady  loading  of  a  structure  by 
velocity  fluctuations  in  the  oncoming  flow,  is 
another  important  example.  Because  of  the  po¬ 
tentially  disastrous  character  of  these  phenom¬ 
ena,  aircraft  flutter  and  buffeting  speeds  must 
be  well  outside  the  flight  envelope.  In  many 
cases,  this  requirement  is  the  determining  fac¬ 
tor  in  the  design  of  wings  and  tail  surfaces. 

An  aeroelastic  response  problem  can  as¬ 
sociate  with  a  stability  problem.  For  exam¬ 
ple,  if  a  control  surface  of  an  aircraft  is  dis¬ 
placed,  or  a  turbulence  in  the  flow  is  encoun¬ 
tered,  the  response  to  be  found  may  be  the 
motion,  the  deformation,  or  the  stress  state  in¬ 
duced  in  the  elastic  body  of  the  aircraft.  When 
the  response  of  the  structure  to  such  an  input 
is  finite,  the  structure  is  stable  and  flutter  will 
not  occur.  When  the  structure  flutters,  its  re¬ 
sponse  to  a  finite  disturbance  is  unbounded. 
However,  an  aeroelastic  response  problem  can 
also  associate  with  a  performance  rather  than 
a  stability  problem.  For  example,  it  is  well- 
known  that  for  transonic  flows,  small  variations 
in  incidence  may  lead  to  considerable  changes 
in  the  pressure  distribution,  shock  position,  and 
shock  strength.  It  is  also  well-known  that  there 
are  some  margins  within  the  Mach  number  and 
incidence  that  can  be  varied  around  the  de¬ 
sign  condition  of  a  supercritical  airfoil  without 
a  serious  deterioration  of  the  favorably  low-drag 
property  of  the  shock-free  flow  condition  [1]. 
Determining  whether  an  oscillating  airfoil  is 
within  or  outside  these  margins  requires  deter¬ 
mining  its  aeroelastic  response. 

Past  literature  on  aeroelasticity  is  mostly 
devoted  to  linear  models  where  the  motion  of 
a  gas  or  a  fluid  past  a  structure,  the  defor¬ 
mation  and  vibration  of  that  structure,  and 
more  importantly  the  interaction  phenomenon 
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itself  are  described  with  linear  mathematical 
concepts  [2,3].  Even  experimental  results  are 
often  interpreted  by  assuming  a  linear  behavior 
of  the  physical  model.  However,  just  as  swim¬ 
ming  in  a  pool  is  a  prerequisite  for  swimming 
in  an  ocean,  understanding  linear  aeroelastic- 
ity  problems  is  essential  for  solving  nonlinear 
ones.  Next,  we  summarize  the  linear  theory  of 
aeroelasticity. 

1.1.  Linear  Theory  of  Aeroelasticity 

The  fundamental  assumptions  behind  the  linear 
formulation  and  solution  of  transient  aeroelastic 
problems  are 

•  the  structure  is  elastic. 

•  it  undergoes  a  harmonic  motion  with  small 
displacement  amplitudes. 

•  the  flow  can  be  approximated  by  a  lin¬ 
earized  theory. 

Under  the  above  conditions,  given  a  free- 
stream  Mach  number  Moo,  the  aerodynamic 
forces  acting  on  an  aircraft  elastic  structure  im¬ 
mersed  in  an  unsteady  flow  can  be  written  as 

F(t)  =  —  A(Aiq(t) -f- A2q(t))  +  Fo(t)  (1) 

where  A(Aiq(f)  -f-  A2q(t))  represents  the  aero¬ 
dynamic  forces  generated  by  the  transient  mo¬ 
tion  of  the  flexible  structure,  and  Fo(t)  rep¬ 
resents  the  unsteady  aerodynamic  forces  that 
would  have  been  generated  if  the  aircraft  had 
a  rigid  rather  than  elastic  structure.  Here,  t 
denotes  time,  A  is  the  dynamic  pressure,  Ai 
and  A.2  denote  the  linear  aerodynamic  opera¬ 
tors  accounting  for  the  surrounding  flow  and 
computed  for  a  given  Moo  and  a  unit  dynamic 
pressure,  and  the  time-dependent  vector  q(t) 
represents  the  discretized  structural  displace¬ 
ments.  Because  these  displacements  are  as¬ 
sumed  to  have  small  amplitudes,  the  governing 


equations  of  dynamic  equilibrium  of  the  aircraft 
elastic  structure  can  be  written  as 

Mq-fDq-HKq  =  -A(Aiq(t)-l-A2q(t))-l-ro(f) 

(2) 

where  a  dot  superscript  denotes  a  time  deriva¬ 
tive,  and  M,  D,  and  K  are  respectively  the 
symmetric  positive  mass,  damping,  and  stiff¬ 
ness  matrices  associated  with  the  discretized 
structure  —  for  example,  but  not  necessarily, 
via  finite  elements.  Eq.  (2)  above  can  be  rear¬ 
ranged  as  follows 

Mq-|-(D-|-AA2)q  +  (K-l- AAi)q  =  Fo(t)  (3) 

If  the  flow  is  steady,  Fo  does  not  vary  with 
time,  and  the  solution  of  the  above  problem  can 
be  decomposed  into  a  steady  and  unsteady  com¬ 
ponents 

q{t)  =  q*+q“(t)  (4) 

where  q*  is  solution  of 

(K-fAAi)q*  =  Fo  (5) 

and  q^(t)  is  solution  of 

Mq“-f(D-hAA2)q“-l-(K  +  AAi)q"  =  0  (6) 

Eq.  (5)  is  the  governing  equation  for  static 
aeroelasticity,  where  the  central  problem  is  the 
effect  of  elastic  deformation  on  the  lift  distribu¬ 
tion  over  lifting  surfaces  such  as  airplane  wings 
and  tails.  At  higher  speeds,  the  effect  of  elas¬ 
tic  deformation  can  become  important  enough 
to  cause  a  wing  to  become  unstable,  to  ren¬ 
der  a  control  surface  ineffective,  or  even  worse 
to  reverse  the  sense  of  control.  The  first  phe¬ 
nomenon  is  known  as  divergence,  and  the  last 
as  aileron  reversal.  Mathematically,  the  di¬ 
vergence  speed  can  be  obtained  from  the  in¬ 
vestigation  of  the  values  of  A  for  which  the 
matrix  (K  -|-  AAi)  becomes  singular.  On  the 
other  hand,  Eq.  (6)  is  the  governing  equation 
of  aeroelastic  dynamic  stability  (or  instability). 
The  flutter  dynamic  pressure  corresponds  to  the 
critical  value  A'^'^  beyond  which  Eq.  (6)  has  a 
solution  q“(t)  that  grows  continuously  in  time. 
This  value  of  the  dynamic  pressure  defines 


the  stability  limit  of  the  solution  of  Eq.  (6). 
Beyond  this  critical  value,  the  elastic  structure 
will  continuously  extract  energy  from  the  sur¬ 
rounding  flow  and  become  dynamically  unsta¬ 
ble.  For  dynamic  pressure  values  below  the 
structure  will  release  energy  to  the  surrounding 
flow  which  will  act  as  a  damper. 

If  the  flow  is  unsteady,  Eq.  (3)  becomes 
the  governing  equation  for  the  dynamic  aeroe- 
lastic  response  problem,  and  its  homogeneous 
counterpart 

Mq -|- (D  +  AA2)q -|- (K -|- AAi  )q  =  0  (7) 

becomes  the  governing  equation  for  the  aeroe- 
lastic  dynamic  stability  problem.  Note  that 
each  of  Eq.  (3)  and  Eq.  (7)  represents  a  system 
of  n  coupled  second-order  differential  equations, 
where  n  is  the  size  of  the  square  matrices  M, 
D,  K,  Ai  and  A2,  and  is  equal  to  the  number 
of  structural  degrees  of  freedom  (d.o.f.)  intro¬ 
duced  in  the  computational  structural  model. 
For  a  detailed  structural  wing  model  or  a  com¬ 
plete  aircraft  configuration,  n  can  be  as  large  as 
a  hundred  thousand,  and  therefore  solving  di¬ 
rectly  Eq.  (3)  for  the  aeroelastic  response  q(f) 
or  Eq.  (7)  for  the  flutter  dynamic  pressure 
becomes  a  formidable  task.  For  this  reason,  Eq. 
(3)  and/or  Eq.  (7)  are  usually  projected  onto 
an  m-dimensional  subspace  (m  <<  n)  repre¬ 
sented  by  its  basis  m  —  [V>i,  ^>21  •••)  V’m]- 
This  basis  is  called  a  modal  basis  because  each 
column  vector  tl>j  is  an  eigenvector  of  the  gen¬ 
eralized  symmetric  eigenvalue  problem 

(8) 

and  therefore  each  is  a  mode  shape  of 
the  structure.  The  above  generalized  sym¬ 
metric  eigenvalue  problem  admits  n  eigenpairs 
where  OJj  is  the  circular  frequency 
associated  with  the  mode  shape  •  This  prob¬ 
lem  arises  when  the  conservative  structural  sys¬ 
tem 

(9) 


is  considered,  and  harmonic  solutions  of  the 
form  q(f)  =  are  sought.  Here  and 

throughout  this  section,  i  denotes  the  complex 
number  satisfying  =  —1.  If  the  eigen¬ 
vectors  are  mass  normalized,  from  Eq.  (8)  and 
the  symmetry  properties  of  M  and  K,  it  follows 
that 

■cu?  0  0  ■ 

=  ill  ^  0  0 

.0  0  ivli  ''  ’ 

—  Im 

where  the  superscript  T  designates  the  trans¬ 
pose  operation,  and  Im  denotes  the  m  X  m  iden¬ 
tity  matrix.  Hence,  projecting  q(f)  onto  the 
modal  basis 

q(t)  =  (11) 

substituting  the  above  expression  in  Eq.  (3), 
premultiplying  that  equation  by  and  ex¬ 

ploiting  the  relationships  given  in  Eqs.  (10) 
leads  to  the  modal  equations  of  equilibrium 

Imy+(Dm  +  AA2m)y-l-(nm  +  -^Ai„,)y  =  Fom{t) 

(12) 

where 

Dm  =  '^mD^m 
Ai„.  =  ^mAl’®^m 
A2m  =  ^lA2«'m 
Fom(f)  =  ’^mFo(0 

and  y(f)  is  known  as  the  vector  of  generalized 
or  modal  coordinates.  If  the  so-called  Rayleigh 
structural  damping  is  used  (D  =  aM  -b  6K, 
a  >  0,  6  >  0),  or  a  modal  damping  is  assumed 
for  the  structure.  Dm  also  becomes  a  diagonal 
matrix.  However,  Ai„  and  A2m  are  in  general 
m  X  m  full  matrices. 

In  summary,  even  though  projecting  Eq.  (3) 
and/or  Eq.  (7)  onto  the  modal  basis  '^m 


Mq  -f  Kq  =  0 
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does  not  completely  uncouple  the  n  second- 
order  differential  equations  because  of  the  pres¬ 
ence  of  the  aerodynamic  operators  Ai  and  A2 , 
this  procedure  is  still  attractive  because  it  re¬ 
duces  the  number  of  coupled  ordinary  diflferen- 
tial  equations  to  be  solved  from  n  to  m  «  n. 

If  an  aeroelastic  response  problem  is  in¬ 
vestigated,  Eq.  (12)  is  usually  solved  for  y(t) 
using  a  numerical  time  integration  algorithm. 
Then,  the  structural  displacement  field  q(f)  is 
recovered  by  making  use  of  Eq.  (11).  However, 
it  should  be  noted  that  Eq.  (11)  can  also  be 
written  as 

r=m<  <n 

q(f)  =  ^my(f)  =  ^rVrit)  (14) 

r=:l 


which  highlights  the  fact  that  q{t)  is  a  trun¬ 
cated  modal  solution  of  the  original  Eq.  (3). 
Aside  from  time  discretization  errors,  the  accu¬ 
racy  of  such  a  solution  depends  on  the  impor¬ 
tance  of  the  contributions  to  the  exact  response 
of  the  structure  of  the  truncated  mode  shapes 
or  eigenvectors.  In  other  words,  it  depends  on 
the  load  distribution  of  the  aircraft  and  the  fre¬ 
quency  content  of  the  aeroelastic  response  of 
the  structure.  For  wing  flutter  problems,  the 
behavior  of  the  structure  is  often  dominated  by 
low  frequency  dynamics,  and  therefore  is  well 
represented  by  the  first  few  modes.  In  that 
case,  only  the  first  few  eigenvectors  V’j  are  usu¬ 
ally  kept  in  the  modal  basis  and  the  trun¬ 
cated  modal  superposition  method  delivers  an 
accurate  solution  of  the  dynamic  aeroelastic  re¬ 
sponse  problem. 

On  the  other  hand,  if  an  aeroelastic  dy¬ 
namic  stability  problem  is  investigated,  the  ho¬ 
mogeneous  form  of  Eq.  (12)  is  solved  for  the 
flutter  dynamic  pressure  A'"'’.  One  methodol¬ 
ogy  for  obtaining  A'^’’  goes  as  follows.  Let  Voo 
denote  the  free-stream  velocity  (flight  speed), 
and  Poo  the  free-stream  air  density.  We  have 


When  the  structure  undergoes  a  harmonic  mo¬ 
tion  characterized  by  a  circular  frequency  a),  the 
linear  aerodynamic  operators  Ai  and  A2  be¬ 
come  a  function  of  the  reduced  frequency  k 

Ai  =  Ai{k) 

A2  =  A2(k)  ^20^ 


If  structural  damping  is  neglected,  seeking  a  so¬ 
lution  of  the  homogeneous  form  of  Eq.  (12)  of 
the  form 


y(t)  = 

a)  =  a;(l-|-m)  |q;|  <<  1 


(17) 


leads  to 

+  fim  +  i:PooV^Am{-^)]  y  =  0 

^  Voo 

Am  =  Ai  (•-^) -|- ja;A2(:r^) 

V'oo  yoo 

(18) 

Note  that  the  first  of  Eqs.  (17)  can  be  rear¬ 
ranged  as 


y(t)  =  (19) 

which  shows  that  the  homogeneous  form  of 
Eq.  (12)  will  have  a  stable  solution  if  and  only 
if  all  of  the  solutions  a)  of  Eq.  (18)  have  a  pos¬ 
itive  real  part  a  >  0.  Therefore,  a  =  0  is 
the  stability  limit,  and  the  sought  after  flut¬ 
ter  dynamic  pressure  A®’’  corresponds  to  the 
critical  value  of  the  flight  speed,  or  the 
critical  value  k‘^^  =  of  the  reduced  fre¬ 

quency,  for  which  Eq.  (18)  admits  a  real  solu¬ 
tion  a)  =  a;(l  -|-  f  X  0)  =  w. 

From  the  second  of  Eqs.  (17),  it  follows 

that 

1  1  .2a 

rrr  «  — 7  -  l—r 


(15) 


(20) 
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Hence,  substituting  Eq.  (20)  into  Eq.  (18), 
making  use  of  the  third  of  Eqs.  (16),  and  ex¬ 
ploiting  the  assumption  |a:l  <<  1  finally  gives 

Zrn{k)  y  =  (-4  -  y 

Z™(fc)  =  -  \^^rn{k)) 

which  shows  that  (1  —  i2a)lu>^  is  a  complex 
eigenvalue  of  a  matrix  Zm  that,  for  a  fixed  free- 
stream  air  density  poo,  depends  only  on  the  re¬ 
duced  frequency  k.  Therefore,  the  flutter  dy- 

2 

namic  pressure  A'^'’  =  PooV^  /2  can  be  found 
by  sweeping  over  the  values  of  Ar,  and  solving 
for  each  k  the  eigenvalue  problem  (21).  Among 
all  possible  critical  values  of  the  reduced  fre- 
quency  for  which  a  real  eigenvalue  l/u?  is 
found  —  and  therefore  for  which  a  vanishes  — 
the  flutter  speed  is  given  by  the  smallest  value 
and  the  flutter  dynamic  pres¬ 
sure  by  the  corresponding  A'^’’  =  PooV^  /2. 
This  procedure  is  known  as  the  “k”  method, 
or  the  “k-sweeping”  method.  It  is  implemented 
in  many  industrial  codes  (see,  for  example,  [4]). 
It  is  accurate  when  the  structure  is  less  than 
10%  damped.  When  the  structure  has  a  higher 
percentage  of  damping,  other  methods  such  as 
the  “p-k”  method  [5]  can  be  used  for  finding  the 
flutter  dynamic  pressure  Such  methods  are 
in  general  more  expensive  than  the  “k”  method 
and  are  beyond  the  scope  of  this  paper. 

At  this  point,  the  reader  should  recall  that 
the  linear  aerodynamic  operators  Ai  and  A2 
are  computed  for  a  specified  free-stream  Mach 
number,  and  therefore  is  also  computed 
for  a  specified  Moo  (and  a  specified  free-stream 
air  density  poo)-  This  implies  that  for  each 
value  of  Moo,  there  exists  a  critical  free-stream 
speed  of  sound  =  V^fMoo,  and  that  a 
curve  Cto  =  c^(Moo)  can  be  determined.  The 
intersection  of  this  curve  with  the  horizontal 
line  =  320m /s  gives  the  critical  free-stream 
Mach  number  M^. 


So  far,  the  derivation  of  the  linear  aero¬ 
dynamic  operators  Ai  and  A2  has  not  been 
discussed.  It  has  only  been  stated  that  a  lin¬ 
earized  flow  theory  and  a  harmonic  motion 
of  the  structure  with  small  displacement  am¬ 
plitudes  are  assumed.  More  precisely,  A  = 
Ai  -f  iojA2  can  be  computed  using  the  doublet- 
lattice  method  [6]  in  the  subsonic  regime,  and 
the  potential  gradient  method  [7],  or  the  har¬ 
monic  gradient  method  [8],  or  the  piston  the¬ 
ory  [2]  in  the  supersonic  regime.  In  all  cases, 
the  flow  is  assumed  to  be  inviscid,  irrotational, 
and  isentropic.  In  the  transonic  regime,  the 
mixed  subsonic-supersonic  flow  patterns  and 
shock  waves  are  such  that  there  are  no  reliable 
theoretical  means  for  predicting  the  unsteady 
aerodynamic  forces.  In  that  case,  the  linear 
aeroelasticity  theory  simply  breaks  down.  This 
is  most  unfortunate  because  of  the  current  re¬ 
newed  interest  in  transonic  flight  for  both  mili¬ 
tary  (F-16)  and  civil  aircraft. 

Besides  transonic  flights,  there  are  many 
other  important  cases  where  the  linear  aeroelas- 
tic  theory  cannot  be  used  for  predicting  the  dy¬ 
namic  response  or  stability  of  an  aircraft.  These 
include,  to  name  only  a  few,  problems  where  the 
structure  undergoes  large  displacements  and/or 
rotations  —  as  an  example,  we  note  that  the 
maximum  upward  deflection  of  the  wing  of  the 
B52  bomber  is  22  feet  [2]  —  parachute  dynam¬ 
ics,  bluff  body  oscillators,  airfoil  oscillations  in 
separated  flow,  buffeting,  and  high-G  and  high 
angle  of  attack  maneuvers  such  as  those  per¬ 
formed  by  the  X-31  aircraft.  Some  of  these 
and  related  problems  are  discussed  in  [9]  where 
emphasis  is  placed  on  the  fundamental  under¬ 
standing  of  the  nonlinear  theory  of  interaction, 
others  are  still  unresolved.  The  pressing  need 
for  solving  and  understanding  all  of  these  prob¬ 
lems  is  the  main  motivation  for  designing  a  re¬ 
liable  nonlinear  transient  aeroelastic  numerical 
simulation  capability. 
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1.2.  Formulation  of  Coupled  Nonlinear 
Aeroelastic  Problems 

Here,  the  structure  is  no  longer  restricted  to  a 
harmonic  motion  with  small  displacement  am¬ 
plitudes.  In  principle,  there  is  also  no  reason  to 
confine  its  constitutive  modeling  to  that  of  an 
elastic  material.  However,  while  aircraft  struc¬ 
tures  can  undergo  large  displacements  and  ro¬ 
tations,  they  seldom  experience  large  strains. 
Therefore,  the  nonlinear  modeling  of  the  struc¬ 
tural  behavior  can  be  limited  to  the  proper  ac¬ 
counting  of  nonlinear  geometric  effects  without 
a  serious  loss  of  generality. 

More  importantly,  the  aerodynamic  forces 
acting  on  the  structure  are  no  longer  predicted 
here  by  the  use  of  a  linear  aerodynamic  oper¬ 
ator  because  of  the  important  limitations  as¬ 
sociated  with  such  an  approach  and  discussed 
at  the  end  of  Section  1.1.  Rather,  these  un¬ 
steady  forces  are  determined  from  the  solution 
of  the  compressible  Euler  equations  when  vis¬ 
cous  effects  are  neglected,  and  the  solution  of 
the  compressible  Navier-Stokes  equations  oth¬ 
erwise.  Furthermore,  no  restriction  is  imposed 
on  the  nature  of  the  fluid/structure  coupling, 
at  least  in  principle.  This  coupling  is  numeri¬ 
cally  modeled  by  suitable  fluid/structure  inter¬ 
face  boundary  conditions.  Clearly,  this  means 
that  the  methodology  described  here  for  simu¬ 
lating  nonlinear  transient  aeroelastic  problems 
is  based  on  the  simultaneous  solution  of  the  gov¬ 
erning  nonlinear  fluid  and  structure  equations, 
and  as  such,  is  computationally  intensive  and 
can  benefit  from  parallel  processing. 

One  difficulty  in  handling  numerically  the 
fiuid/structure  coupling  stems  from  the  fact 
that  the  structural  equations  are  usually  formu¬ 
lated  with  material  (Lagrangian)  co-ordinates, 
while  the  fluid  equations  are  typically  written 
using  spatial  (Eulerian)  co-ordinates.  There¬ 
fore,  a  straightforward  approach  to  the  solution 
of  the  coupled  fluid/structure  dynamic  equa¬ 
tions  requires  moving  at  each  time-step  at  least 


the  portions  of  the  fluid  grid  that  are  close  to 
the  moving  structure.  This  can  be  appropri¬ 
ate  for  small  displacements  of  the  structure  but 
may  lead  to  severe  grid  distorsions  when  the 
structure  undergoes  large  motion.  Several  dif¬ 
ferent  approaches  have  emerged  as  an  alterna¬ 
tive  to  partial  regridding  in  transient  aeroelastic 
computations,  among  which  we  note  the  arbi¬ 
trary  Lagrangian/Eulerian  (ALE)  formulation 
[10-12],  the  co-rotational  approach  [13,14],  dy¬ 
namic  meshes  [15]  which  are  closely  related  to 
ALE  concept,  interpolation  based  methods  [16], 
and  space-time  formulations  [17].  All  of  these 
approaches  treat  a  computational  aeroelastic 
problem  as  a  coupled  two-field  problem. 

However,  a  moving  mesh  (Fig.  1)  can  also 
be  viewed  as  a  pseudo-structural  system  with 
its  own  dynamics  [18],  and  therefore,  the  cou¬ 
pled  transient  aeroelastic  problem  can  be  for¬ 
mulated  as  a  three-  rather  than  two-field  prob¬ 
lem:  the  fluid,  the  structure,  and  the  dynamic 
mesh  (Fig.  2).  The  semi-discrete  equations 
governing  this  three-way  coupled  problem  can 
be  written  as  follows; 


where  x  is  the  displacement  or  position,  depend¬ 
ing  on  the  context  of  the  sentence  of  a  moving 
fluid  grid  point,  W  is  the  fluid  state  vector, 
V  results  from  the  finite  element/volume  dis¬ 
cretization  of  the  fluid  equations,  is  the  vec¬ 
tor  of  convective  ALE  fluxes  that  depend  on  the 
fluid  grid  velocity,  R  is  the  vector  of  diffusive 
fluxes,  q  is  as  before  the  structural  displacement 
vector,  f'"*  denotes  the  vector  of  internal  struc¬ 
tural  forces  that  is  equal  to  Kq  in  the  linear 
case,  the  vector  of  external  forces  acting  on 
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the  structure,  M  is  the  finite  element  mass  ma¬ 
trix  of  the  structure,  M,  D,  and  K  are  fictitious 
mass,  damping,  and  stiffness  matrices  associ¬ 
ated  with  the  fluid  moving  grid  (Fig.  3)  and  con¬ 
structed  to  avoid  any  parasitic  interaction  be¬ 
tween  the  fluid  and  its  grid,  or  the  structure  and 
the  moving  fluid  grid  [18],  and  Kc  is  a  transfer 
matrix  that  describes  the  action  of  the  motion 
of  the  structural  side  of  the  fluid/structure  in¬ 
terface  on  the  fluid  dynamic  mesh  [19].  For  ex- 

ample,  M  =  D  =  0,  and  K  =  K  where  K  is 
a  rotation  matrix  corresponds  to  a  rigid  mesh 
motion  of  tl^fluid  grid  around  an  oscillating 
airfoil,  and  M  =  D  =  0  includes  as  particu¬ 
lar  cases  the  spring-based  mesh  motion  scheme 
introduced  in  [15]  and  the  continuum  based  up¬ 
dating  strategy  advocated  by  several  investiga¬ 
tors  (see,  for  example,  [17]). 


Fig.  1.  Moving  and  deforming  fluid  grid 


Fig.  3.  A  pseudo-structural  tetrahedron 
in  a  fluid  mesh 


The  first  of  Eqs.  (22)  is  derived  in  details 
in  Section  2.  The  second  of  Eqs.  (22)  is  the 
standard  nonlinear  structural  dynamics  equa¬ 
tion  of  equilibrium.  The  notation  f®®'*(W(f),x) 
is  used  to  remind  the  reader  that  the  external 
forces  acting  on  the  structure  include,  among 
others,  the  aerodynamic  forces  that  are  com¬ 
puted  from  the  knowledge  of  the  fluid  state 
vector  W  and  the  motion  and  deformation  of 
the  surface  of  the  structure,  which  in  turn  con¬ 
trols  the  motion  x(t)  of  the  fluid  grid.  Hence, 
Eqs.  (22)  are  fully  coupled. 


Computational  Domain 


FliiidtncMi-tK'Jdi 

Fluid  (far  field) 


Fig.  2.  Three-field  formulation 


1.3.  Objectives  and  outline  of  this  paper 

Each  of  the  three  components  of  the  three-way 
coupled  problem  described  by  Eqs.  (22)  has  dif¬ 
ferent  mathematical  and  numerical  properties, 
and  distinct  software  implementation  require¬ 
ments.  For  Euler  and  Navier-Stokes  flows,  the 
fluid  equations  are  nonlinear.  The  structural 
equations  and  the  semi-discrete  equations  gov¬ 
erning  the  pseudo-structural  fluid  grid  system 
may  be  linear  or  nonlinear.  The  matrices  result¬ 
ing  from  a  linearization  procedure  are  in  general 
symmetric  for  the  structural  problem,  but  they 
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are  typically  unsymmetric  for  the  fluid  prob¬ 
lem.  Morevoer,  the  nature  of  the  coupling  in 
Eqs.  (22)  is  implicit  rather  than  explicit,  even 
when  the  fluid  mesh  motion  is  ignored.  The 
fluid  and  the  structure  interact  only  at  their 
interface,  via  the  pressure  and  viscous  forces, 
and  the  motion  of  the  physical  interface.  How¬ 
ever,  for  Euler  and  Navier-Stokes  compressible 
flows,  the  pressure  variable  cannot  be  easily  iso¬ 
lated  neither  from  the  fluid  equations  nor  from 
the  fluid  state  vector  W.  Consequently,  the  nu¬ 
merical  solution  of  Eqs.  (22)  via  a  fully  coupled 
monolithic  scheme  is  computationally  challeng¬ 
ing  and  software-wise  unmanageable. 

Alternatively,  Eqs.  (22)  can  be  solved 
via  a  partitioned  analysis  or  a  staggered  proce¬ 
dure  [20-23].  This  approach  offers  several  ap¬ 
pealing  features  including  the  ability  to  use  well 
established  discretization  and  solution  methods 
within  each  discipline,  simplification  of  software 
development  efforts,  and  preservation  of  soft¬ 
ware  modularity. 

Traditionally,  nonlinear  transient  aeroelas- 
tic  problems  have  been  solved  via  the  simplest 
possible  partitioned  analysis  whose  cycle  can 
be  described  as  follows:  a)  advance  the  struc¬ 
tural  system  under  a  given  pressure  load,  b)  up¬ 
date  the  fluid  mesh  accordingly,  and  c)  advance 
the  fluid  system  and  compute  a  new  pressure 
load  [15,16,24-27].  Occasionally,  some  investi¬ 
gators  have  advocated  the  introduction  of  a  few 
predictor /corrector  iterations  within  each  cycle 
of  this  three-step  staggered  integrator  in  order 
to  improve  accuracy  [28],  especially  when  the 
fluid  equations  are  nonlinear  and  treated  im¬ 
plicitly  [29].  However,  more  efficient  staggered 
solution  procedures  can  and  should  be  devised. 

The  main  objective  of  this  paper  is  to 
present  a  computational  framework  for  the  mas¬ 
sively  parallel  solution  of  the  three-way  coupled 
Eqs.  (22)  that  is  being  developed  at  the  Uni¬ 
versity  of  Colorado  by  the  author  and  his  co- 
workers.  This  is  certainly  not  to  imply  that 


we  are  the  only  research  group  working  on  this 
problem.  However,  we  believe  that  our  com¬ 
putational  framework  includes  many  innovative 
ideas  and  unique  capabilities  that  are  worthy 
discussing.  For  this  purpose,  the  remainder  of 
this  paper  is  organized  as  follows. 

At  the  heart  of  nonlinear  transient  aeroe- 
lastic  simulations  is  the  computation  of  un¬ 
steady  flow  problems  with  moving  boundary 
conditions  and  dynamic  unstructured  meshes. 
In  this  paper,  we  do  not  discuss  the  state- 
of-the-art  of  unsteady  flow  solvers,  especially 
that  their  status  seems  to  be  far  from  satis¬ 
factory  [30].  For  this  specific  topic,  we  refer 
the  reader  to  references  [30,31].  However,  we 
focus  in  Section  2  on  the  important  issues  of 
geometric  conservation  laws  (GCLs)  which,  in 
the  presence  of  dynamic  meshes,  impose  im¬ 
portant  constraints  on  the  algorithms  employed 
for  time-integrating  the  semi-discrete  equations 
governing  the  fluid  and  dynamic  mesh  motions. 
In  particular,  we  address  the  problem  of  satis¬ 
fying  both  displacement  and  velocity  continu¬ 
ity  constraints  between  the  structure  and  fluid 
mesh  motions  at  the  fluid/structure  interface, 
and  the  impact  of  this  problem  on  the  accuracy 
and  stability  of  the  time-integrator  selected  for 
predicting  the  aeroelastic  structural  response. 
In  Section  3,  we  present  a  broad  family  of  stag¬ 
gered  solution  procedures  where  the  fluid  flow 
can  be  integrated  using  either  an  implicit  or  an 
explicit  scheme,  and  the  structural  response  is 
advanced  using  an  implicit  one.  We  address  im¬ 
portant  issues  pertaining  to  numerical  stability, 
subcycling,  accuracy  vs.  speed  trade-offs,  im¬ 
plementation  on  heterogeneous  computing  plat¬ 
forms,  and  inter-field  as  well  as  intra-field  par¬ 
allel  processing.  Next,  we  describe  in  Section  4 
our  particular  two-  and  three-dimensional  un¬ 
steady  flow  solvers.  In  Section  5,  we  discuss 
the  solution  of  the  structural  dynamics  equa¬ 
tions.  Because  our  goal  is  to  handle  linear  as 
well  as  nonlinear  structural  dynamics  problems, 
we  opt  for  a  direct  time  integration  method 
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rather  than  the  restrictive  modal  superposition 
approach.  We  describe  a  substructure  based 
nonlinear  time  integration  implicit  algorithm 
that  features  second-order  accuracy  and  uncon¬ 
ditional  stability.  For  scalability  purposes,  we 
also  adopt  as  a  linearized  solver  the  substruc¬ 
ture  based  preconditioned  conjugate  gradient 
FETI  method  [32,33]  equipped  with  the  pro¬ 
jection  scheme  presented  in  [34]  for  solving  it¬ 
eratively  and  efficiently  systems  with  repeated 
right  hand  sides.  In  general,  the  fluid  and 
structure  meshes  have  two  independent  repre¬ 
sentations  of  the  physical  fluid/structure  in¬ 
terface,  and  do  not  necessarily  match  at  that 
interface.  We  discuss  this  and  other  related 
issues  in  Section  6  where  we  also  describe 
“Matcher”  [35],  a  program  for  generating  in 
parallel  the  data  structures  needed  for  handling 
arbitrary  and  non-conforming  fluid/structure 
interfaces  in  aeroelastic  computations.  In  Sec¬ 
tion  7,  we  turn  to  the  solution  of  the  equations 
governing  the  dynamic  motion  of  the  fluid  grid. 
In  Section  8,  we  describe  a  unified  and  portable 
approach  for  parallel  fluid/structure  computa¬ 
tions  that  is  based  on  the  mesh  partitioning 
paradigm.  We  also  briefly  discuss  the  contro¬ 
versial  topic  of  what  constitutes  a  good  mesh 
partition  for  parallel  processing.  In  Section  9, 
we  illustrate  our  framework  for  computational 
dynamic  aeroelasticity  with  examples  ranging 
from  the  numerical  investigation  on  an  iPSC- 
860  massively  parallel  processor  of  the  instabil¬ 
ity  of  flat  panels  with  infinite  aspect  ratio  in 
supersonic  airstreams,  to  the  solution  on  the 
Paragon  XP/S,  Cray  T3D  and  IBM  SP2  paral¬ 
lel  systems  of  three-dimensional  wing  response 
problems  in  the  transonic  regime.  Finally,  we 
conclude  this  paper  in  Section  10. 

REMARK  1\  Some  of  the  content  of  this 
paper  is  based  on  recent  publications  by  the 
author  and  his  co-workers.  These  publications 
are  indicated  between  [  ]  at  the  beginning  of 
each  section  and  wherever  is  appropriate. 


2.  GEOMETRIC  CONSERVATION 


LAWS  [19,36] 


As  stated  earlier,  the  matrices  K  and  K  that 
appear  in  the  third  of  Eqs.  (22)  are  designed  to 
enforce  continuity  between  the  grid  motion  and 
the  structural  displacement  and/or  velocity  at 
the  moving  fluid/structure  boundary  Tp/sit) 


x{t)  -  q{t)  on  Tp/sit) 
x{t)  =  q{t)  on  Tp/s{t) 


(23) 


The  first  of  Eqs.  (22)  involves  both  the  po¬ 
sition  and  velocity  of  the  underlying  fluid  dy¬ 
namic  mesh.  These  entities  are  usually  ob¬ 
tained  from  the  solution  of  the  second  and  third 
of  Eqs.  (22),  and  optionally  from  the  use  of  a 
predictor.  When  selecting  a  method  for  inte¬ 
grating  the  fluid  equations,  it  is  desirable  to 
choose  one  that  preserves  the  trivial  solution 
of  a  uniform  flow  field  (  in  the  absence  of  other 
boundary  conditions,  a  uniform  flow  field  is  a 
solution  of  the  Navier-Stokes  equations).  In 
this  section,  we  show  that  this  property  is  veri¬ 
fied  only  when  the  numerical  scheme  chosen  for 
solving  the  fluid  equations  and  the  algorithm 
constructed  for  updating  the  mesh  position  and 
velocity  satisfy  a  certain  condition.  We  refer  to 
this  condition  as  the  Geometric  Conservation 
Law  (GCL)  because:  (a)  it  can  be  identified 
as  integrating  exactly  the  area  or  volume  swept 
by  the  boundary  of  a  cell  in  a  finite  volume 
formulation,  and  (b)  its  principle  is  similar  to 
the  GCL  condition  that  was  first  pointed  out 
in  [37]  for  structured  grids  and  finite  difference 
schemes.  In  the  present  work,  we  derive  the 
conditions  imposed  by  the  GCL  in  terms  of  an 
appropriate  choice  of  integration  points  in  time, 
and  a  consistent  scheme  for  updating  the  grid 
point  velocities.  This  is  in  contrast  with  pre¬ 
vious  works  [38,39]  where  the  GCL  was  ad¬ 
dressed  in  terms  of  averaged  normal  or  velocity 
coefficients  for  moving  finite  volume  cells.  The 
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approach  exposed  herein  for  deriving  and  satis¬ 
fying  a  GCL  is  deemed  more  general  than  those 
previously  discussed  in  the  literature.  For  ex¬ 
ample,  it  recovers  the  results  of  the  normal  av¬ 
eraging  algorithm  recently  proposed  in  [38]  for 
finite  volume  discretizations,  and  applies  as  well 
to  finite  element  methods  that  are  not  covered 
by  this  normal  averaging  procedure. 

Throughout  this  section,  we  consider  fiow 
computations  using  unstructured  moving  meshes 
We  focus  on  the  Euler  equations,  because  in  our 
formulation  the  viscous  terms  are  not  explicitly 
affected  by  the  mesh  motion.  We  derive  several 
GCL  conditions  for  these  problems,  and  dis¬ 
cuss  their  various  algorithmic  implications.  We 
consider  first  the  case  where  the  finite  volume 
method  is  chosen  for  the  spatial  approximation 
of  the  flow  equations,  and  the  ALE  formulation 
is  used  for  handling  dynamic  meshes.  Then, 
we  analyze  the  cases  where  the  finite  element 
method  is  employed  for  spatial  discretization, 
and  the  moving  mesh  is  treated  with  either  a 
space-time  or  an  ALE  formulation,  respectively. 
In  particular,  we  show  that  space-time  finite  el¬ 
ement  methods  always  satisfy  the  fundamen¬ 
tal  geometric  conservation  law.  We  investigate 
the  consequences  of  the  GCL  condition  on  the 
temporal  integration  of  the  structural  equations 
of  motion.  Most  importantly,  we  address  the 
problem  of  satisfying  both  displacement  and  ve¬ 
locity  continuity  equations  between  the  struc¬ 
ture  and  fluid  mesh  at  the  fluid/structure  in¬ 
terface,  without  violating  the  GCL.  Finally,  we 
highlight  the  importance  of  the  GCL  with  an 
illustration  of  its  effect  on  the  computation  of 
the  transient  aeroelastic  response  of  a  flat  panel 
in  transonic  flow. 

2.1,  The  Finite  Volume  Method 
with  an  ALE  Formulation 

Let  0(t)  C  7^"  {n  =  2,  3)  be  the  flow  domain  of 
interest  and  F(f)  be  its  moving  and  deforming 
boundary.  We  introduce  a  mapping  function 
between  where  time  is  denoted  by  t  and 


the  grid  point  coordinates  by  x,  and  a  reference 
configuration  11(0)  where  time  is  denoted  by  0 
and  the  grid  point  coordinates  by  ^  as  follows 

X  =  x{^,ey,  t-O  (24) 

The  conservative  form  of  the  equations  describ¬ 
ing  Euler  flows  can  be  written  in  arbitrary 
Lagrangian-Eulerian  (ALE)  form  as 

+  =  0 

=  r{W)-xW 

(25) 

where  J  —  det{dx/d^)  is  the  Jacobian  of  the 
frame  transformation  ^  x^  W  denotes  the 
fluid  conservative  variables,  denotes  the  con¬ 
vective  ALE  fluxes,  and  x  =  is  the  ALE 
grid  velocity  that  may  be  different  from  the 
fluid  velocity  and  from  zero. 

The  finite  volume  method  for  unstructured 
meshes  relies  on  the  discretization  of  the  com¬ 
putational  domain  into  control  volumes  or  cells 
Ci  constructed  around  the  vertices  Si,  with 
boundaries  denoted  by  dCi,  and  normals  to 
these  boundaries  denoted  by  Vi. 


Fig.  4.  Control  volume 
(unstructured  two-dimensional  mesh) 

Eq.  (25)  can  then  be  integrated  over  the  con¬ 
trol  cells.  In  an  ALE  formulation,  these  cells 
move  and  deform  in  time.  First,  integration  is 
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performed  over  a  reference  cell  in  the  ^  space  as 
follows 


2^1. ... 

(0) 

+  /  JV^.T"iW,x)  =  0 
Jci(O) 

(26) 

In  the  above  equation,  the  partial  time  deriva¬ 
tive  is  evaluated  at  constant  hence,  it  can  be 
moved  outside  of  the  integral  sign  to  obtain 


-f-  /  W  J  dn^ 

4(0) 


+ 


j  dn^ 


=  0 


(27) 

Switching  back  to  the  time- varying  cells,  Eq.  (27) 
above  can  be  rewritten  as 


^  /  wdii^+  I  v..j^%w,x)dn,  =  0 

Jci{t)  Jcat) 

(28) 

Finally,  integrating  by  parts  the  last  term  yields 
the  governing  integral  equation 

^  /  Wd^,+  I  J^\W,i)Mida  =  0 

(29) 


stable.  An  example  of  such  functions  can  be 
found  in  [41].  For  consistency,  these  numerical 
fluxes  must  verify 

(31) 

Thus,  the  resulting  discrete  equation  is 


^(EiIF0  +  i^^(W,x,i)  =  0 


(32) 


where 


(33) 


is  the  area  for  two-dimensional  flow  problems, 
and  the  volume  for  three-dimensional  flow  prob¬ 
lems,  of  cell  Ci.  Collecting  all  Eqs.  (32)  into  a 
single  system  yields 


(34) 


^(VW)  +  F"(W,x,x)  =  0 


where  V  is  the  diagonal  matrix  of  the  cell  areas, 
W  is  the  vector  containing  all  state  variables 
Wi,  and  is  the  collection  of  the  fluxes  Ff. 
This  also  completes  the  derivation  of  the  first 
of  Eqs.  (22). 

2.1.1.  The  Geometric  Conservation  Law 


In  a  finite  volume  method,  the  flux  through  the 
cell  boundary  dCi{t)  is  usually  evaluated  via  a 
flux  splitting  approximation  [40]  as  follows 

F;'=(W,x,x)  = 

f  iFUWi,x)  +  FliWj,x)).ui  da 
■  JdCij(x) 

(30) 

where  dCij  is  the  intersection  between  the 
boundaries  of  cells  Ci  and  Cj,  Wi  denotes  the 
average  value  of  W  over  the  cell  Ci,  W  is  the 
vector  formed  by  the  collection  of  VFi,  and  x 
is  the  vector  of  the  time-dependent  grid  point 
positions.  The  numerical  flux  functions  and 
JFI  are  designed  to  make  the  resulting  system 


Let  At  and  t”  =  nAt  denote  respectively  the 
chosen  time-step  and  the  n-th  time-station.  In¬ 
tegrating  Eq.  (32)  between  t”  and  leads 
to 

K-(x"+')IFf+'  -K-(x")IFf 

+  /  FtiW,^,±)dt  =  0 

Jt” 

(35) 

The  most  important  issue  in  the  solution  of 
the  first  of  Eqs.  (22)  via  an  ALE  method  is 

^n  +  1 

the  proper  evaluation  of  Ff  (W,x,x)dt  in 

Eq.  (35).  In  particular,  it  is  crucial  to  establish 
where  the  fluxes  must  be  integrated:  on  the 
mesh  configuration  at  t  (x”),  on  that  at 


t  —  or  in  between  these  two  con¬ 

figurations.  The  same  questions  arise  as  to  the 
choice  of  the  mesh  velocity  vector  x. 

Clearly,  a  proposed  numerical  algorithm 
for  computing  the  quantity  F/(w,x,x)df 
involving  general  and  arbitrary  time  depen¬ 
dent  fluid  state  vectors  and  mesh  configura¬ 
tions  cannot  be  acceptable  unless  it  conserves 
the  state  of  a  uniform  flow.  Let  W*  denote  a 
given  uniform  state  of  the  flow.  Substituting 
W]:  =  =  W*  in  Eq.  (35)  gives 


{Vr'^"-Vr)W*+  dt  =  0 

Jt^ 

(36) 

where  W*  is  the  vector  of  the  state  variables 
when  Wk  =  W*  for  all  k.  From  Eq.  (30),  it 
follows  that 

F)^(W*,x,x)  = 

f  {TX{W\i)  +  Tt{W%k)).ui  da 

j  JdCijix) 

=  /  -  iW*)Mi  da 

JdCi(X) 

(37) 

Given  that  the  integral  on  a  closed  boundary 
of  the  flux  of  a  constant  function  is  identically 
zero 

I  T{W*)Mida  =  0  (38) 

Jaci(x) 


it  follows  that 


FliW 


■,x,x)  =  -  / 
Jd 


xW*  .Ui  da  (39) 


SC,-(X) 


Hence,  substituting  Eq.  (39)  into  Eq.  (36) 
yields 


(K'(x”+')-K-(x”))fF* 


x.Ui  da  dt)W*  =  0 


'9C;(X) 


which  can  be  rewritten  as 


(41) 


Eq.  (41)  above  defines  a  geometric  conservation 
law  (GCL)  that  must  be  verified  by  any  pro¬ 
posed  ALE  mesh  updating  scheme.  This  law 
states  that  the  change  in  area  (volume)  of  each 
control  volume  between  F  and  must  be 
equal  to  the  area  (volume)  swept  by  the  cell 
boundary  during  At  =  —  f”.  Therefore, 

the  updating  of  x  and  x  cannot  be  based  on 
mesh  distorsion  issues  alone  when  using  ALE 
solution  schemes. 

The  assumption  that  the  numerical  method 
performs  exactly  the  integration  of  Eq.  (38)  is 
referred  to  in  [39]  as  the  Surface  Conservation 
Law  (SCL).  Satisfying  of  this  condition  is  nec¬ 
essary  for  flow  computations  on  static  meshes 
and  is  not  specific  to  dynamic  ones.  Therefore, 
we  do  not  discuss  this  condition  in  this  section 
any  further  and  refer  the  reader  to  [39]  for  ad¬ 
ditional  details. 


2.1.2.  Implications  of  the  GCL 


From  the  analysis  presented  in  the  previous  sec¬ 
tion,  it  follows  that  an  appropriate  scheme  for 
evaluating  F/'(W*,x,x)dt  in  Eq.  (36)  is 

a  scheme  that  respects  the  GCL  (41).  Note 
that  once  a  mesh  updating  scheme  is  given, 
the  left  hand  side  of  Eq.  (41)  is  always  ex¬ 
actly  computed.  Hence,  a  proper  method  for 
evaluating  Fi{yV* ^-K^-kjdt  is  a  method 

that  obeys  the  GCL  and  therefore  computes  ex¬ 
actly  the  right  hand  side  of  Eq.  (41) —  that  is, 

Jtn  Jaci(x) 
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Substituting  Eqs.  (43,44)  into  Eq.  (42)  yields 


2.1.8.  The  Two-Dimensional  Case 

Given  that  in  two  dimensions  dCi  is  the  union 
of  segments,  it  suffices  to  consider  the  integra¬ 
tion  of  x.n  along  a  segment  [a6]  with  a  normal 


rf 


{axa  -f  (1  —  a)xb).n  I  dadt 


.nA-1 

I[ab]  =  /  / 

Jt”  J\ab] 


i(xa  -f  Xb).n  I  dt 

1 

-{Xa  +  Xb)H{xa  —  Xb)  dt 


Let  Xa  and  Xb  denote  the  instantaneous  po¬ 
sitions  of  two  connected  vertices  a  and  h 
(Fig.  5).  The  position  of  any  point  on  the  edge 
[a6]  during  the  time-interval  can  be 

parametrized  as  follows 

x{t)  =  axa{t)  +  (1  -  a)xb{t) 
x(t)  =  axa{t)  +  (I  -  a)xb{t)  (43) 
aG[0,  1]  te[t\  r+'] 

where 

^^^t)  =  6it)x:+^  +  {i-6{t))x: 

Xb{t)  -  S{t)xl'^'^  +  (1  -  K'^))xb 

and  S(t)  is  a  real  function  that  satisfies 

5(r)  =  0;  ,5(r+')  =  l  (45) 


-b  (1  -  ^(t))(x:  -  ai^)  dt 


where  I  is  the  length  of  edge  [a6],  and 

H  =  ^  ^  velocities  Xa  and 

Xb  can  be  obtained  from  the  differentiation  of 
Eq.  (44). 

Xa  =  S{t){x2'^^  -  Xa) 

Xb  =  ^(t)(ai”+^  -  Xb) 
and  /[a6]  can  be  finally  written  as 

j  /'  -  *:)  +  (*r+'  - 

-  a^r')  +  (1  -  -  *“))  <u 


1  A(x;+'-x; 


)+(xr’ -»>?)) 


Fig.  5  Parametrization  of  an  edge 
in  a  two-dimensional  space 


ff  («(x;+'  - 1^+' )  +  (1  -  <)(x:  -xt))ds 

(48) 

Clearly,  the  integrand  of  /[at]  is  linear  in  6. 
Therefore,  l[ab]  can  be  exactly  computed  using 
the  midpoint  rule,  provided  that  Eq.  (47)  holds 
—  that  is 

i  =  Sit)ix-+^ -x-)=^^{x^+^  -X-)  (49) 
which  in  view  of  Eq.  (45)  can  also  be  written 


x^+^  -  x^ 


In  summary,  the  GCL  derived  herein 
shows  that  for  two-dimensional  problems,  the 
integrand  of  FAW,x,x)  dt  in  Eq.  (35) 


must  be  evaluated  at  the  midpoint  configura¬ 
tion,  and  that  this  integral  must  be  computed 
as  follows 
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f;?(w,x,x)  dt  = 

A^i^"(W^x”+^,x”+^) 

W"-bW^ 


■W^+2  = 


n  I  n+1 


_  X  +X 


x"+2  = 


X«+l  _  x" 


where  the  superscript  k  depends  on  the  time 
discretization  of  the  flow  equation. 

2.1.4-  The  Three-Dimensional  Case 

In  a  three-dimensional  space,  the  boundary  of 
each  cell  is  polygonal  and  can  be  decomposed 
into  a  set  of  non  overlapping  triangular  facets. 
Similarly  to  the  two-dimensional  case,  let  I[ahc] 
denote  the  flux  crossing  the  facet  [abc] 


X  -  aiXa(t)  +  a2Xb{t)  H-  (1  -  tti  -  a2)xc{t) 

X  =  aiXa{t)  4-  a2Xb{t)  +  (1  -  ai  -  a2)xc{t) 

cti  G  [O,  1];  OC2  G  [0,  1  —  o;i];  t  G  [t  ,  t  "^  ] 

(53) 

where 

Xait)  =  +  (1  -  b{t))x2 

Xb{t)  =  S{t)x^'^^  +  (1  -  (^4) 

Xb(t)  =  S(t)xJ^'^^  +  (1  - 

and  S(t)  is  given  in  (45).  Substituting  the  above 
parametrization  in  (52)  we  obtain 


/  // 


1  /*1— ai 


(aiXa  +  azifc 


-n  - 

J  J  [abc] 


dcrdt  (52) 


d-  (1  -  ai  —  a2)xc).n  \xac  A  Xbc\  da2  dai  dt 

=  /  -^{xa  +  Xb  +  Xc).{Xac  /\Xbc)  dt 


/  -8{AXa  +  4^i.Xb  +  4li.Xc).{Xac  A  Xbc)  dt 

r  1 

J  g(Aa;a  +  Axb  +  Axc).{xac  A  Xbc)  dS 


Let  Xa,  Xb  and  Xc  denote  the  instantaneous 
positions  of  three  connected  vertices  a,  b  and 
c.  The  position  of  any  point  on  the  facet 
can  be  parametrized  as  follows  (see  Fig.  6) 


Fig.  6.  Parametrization  of  a  facet 
in  a  three-dimensional  space 


^ac  —  ^bc  —  ^b 

Axa  =  Xa'^^  -  x2]  Axb  —  X^'^^  -  Xb  (56) 

An  +  1  n 

Xc  =  Xc  ~  Xc 

Noting  that 

Xac  A  Xt)c  — 

(6xlt^  +  (1  -  b)Xac)  A  (Sx^^^  +  (1  -  ^)xD 

(57) 

is  a  quadratic  function  of  S,  the  integrand  of 
F[atc]  is  clearly  quadratic  in  S  and  therefore  can 
be  exactly  computed  using  a  2-point  integration 
rule.,  provided  that  Eq.  (50)  is  used  to  com¬ 
pute  X. 
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Expanding  r^*,  we  obtain 


2^/3^  ^ 

1 


Hence,  the  proper  method  for  evaluat- 

^n  +  1 

ing  J  „  F/(W,x,x)  dt  that  respects  the 

GCL  (41)  in  the  three-dimensional  case  is 


Ef(W,x,x)  dt 

l 

-hf)‘^(W^^x™^x”+^)) 

1  1 

W"+’'  =  7?W"+^  +  (1  -  rj)W^ 

^n+v  ^  ^^n  +  1  ^ 

„n-|-l 

„+i  _  X  -  X 


+  A  +  i;?'  A  xL)) 


which  shows  that  the  proposed  GCL  (20)  recov¬ 
ers  the  same  results  as  the  averaged-normals 
method  proposed  in  [38]  for  the  finite  volume 
discretization  of  flow  equations  with  moving 
meshes. 

2.2.  The  Stabilized  Finite  Element 
Method  with  a  Space-Time  Formulation 

2.2.1.  Semi-Discretization 


where  the  superscripts  Icl  and  k2  depend  on  the 
time  discretization  of  the  flow  equation. 

2.1.5.  Recovery  of  the  Averaged-Normals  Method 

In  [38],  the  convected  flux  accross  the  facet  I[abc] 
is  computed  using 


I[abc]  =  g(AXa  +  Axfc  +  AXc).r] 

fl  =  ^(®ac  A  Xbc  +  A 

+  ^(Xac  A  xl^^  +  xlt^  A  Xic)) 


while  the  evaluation  of  Eq.  (52)  using  the  two- 
point  rule  gives 


I[abc]  -  ^(^Xa  +  AXf,  +  AXc).ri* 

*  1  /  ml  .  ml  I  m2  .  m2\ 

—  ^(^ac  A  X^fQ  -|-  Xq^q  a  Xf^f.  ) 


Time-integration  in  space-time  finite  element 
methods  is  derived  in  a  different  manner  than 
what  has  been  presented  so  far.  Space-time 
finite  element  methods  contain  the  time  inte¬ 
gration  formula  in  the  chosen  shape  functions. 
These  methods  are  basically  weighted  resid¬ 
ual  formulations  that  perform  an  integration  in 
space  and  time  of  the  product  of  the  Euler  equa¬ 
tions  and  an  appropriate  weighting  function. 
Stabilization  is  usually  required  for  the  spatial 
approximation  [42].  In  this  section,  we  focus  on 
the  stabilized  Least-Square/Galerkin  method 
and  time-discontinuous  shape  functions. 

Let  0  =  <  ...  <  =  r  be  a 

partition  of  the  time-interval  /  =  ]0,T[,  and  In 
be  the  subinterval  ]t",t”"''^[.  A  space-time  slab 
is  defined  in  In  X  R^,  where  d  designates  the 
spatial  dimension,  as  follows 

Qn  =  {{t,Q{t))  I  t  e  In} 


V 


(60) 


(62) 
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with  boundary 


lim  W{e+€)  (64) 

e-»0± 


Pn^{it,T(t))\t£ln}  (63) 

2.2.2.  The  Geometric  Conservation  Law 

For  each  space-time  slab,  the  spatial  domain  is 

subdivided  intone,  elements  l^^(t),  e  =  l,...,ne,  Pr°^ided  that  the  spatial  integration  scheme 
(see  Fig.  7).  The  following  notational  conven-  compute  exactly  the  following  quantities 

tion  is  adopted 

I  dQ 

\un^W(e+€)  (64)  Jo 

/  Vi  da  =  0 

Given  some  finite  element  spaces  and  Vn, 
the  space-time  (discontinuous)  Least- Square/ Galerkin 

method  for  solving  the  Euler  flow  equations  follows  that  W  —  W  is  always  a  solution  of 

goes  as  follows  Hence,  a  space-time  stabilized  finite 

■n-  .  rrrh  ^  -rrA  ^ .h  element  method  always  satisfies  the  GCL.  This 

Find  e  Sn  such  that  for  all  G  .  .  ,  ,  „ 

IS  certainly  an  advantage.  However,  space-time 

f  finite  element  methods  are  rather  computation- 

/  ))  ally  expensive. 

Qn 

-f  /  y^(t!fi)(iy^(f+)  —  ty^(t”  ))  dfi  2.3.  The  stabilized  Finite  Element 

Method  with  an  ALE  Formulation 

‘^el  ft  pn 

+  ^  /  {CW^)^i{CV’^)dQ=  V^FtriidP 

_ 1  Jo?  Jp  2.3.1.  Semi- Discretization 


Find  e  such  that  for  all  V’^  G 


Method  with  an  ALE  Formulation 


2.3.1.  Semi- Discretization 


The  stabilized  finite  element  method  with  an 


where  P  —  jj  P  /u  is  a  stabi-  ALE  formulation  can  be  derived  by  multiply- 

ing  Eq.  (25)  by  a  weighting  function  V^(6), 
lization  parameter.  .  .  . 

integrating  over  fl(0),  and  adding  a  consistent 
stabilization  term  5(y^,FF)  to  obtain 


r 


Fig.  7.  Space-time  slabs 


+  5(y\LF)  =  0 


For  example,  SiV^  can  be  selected  as 


S{V^,W) 


Consistency  requires  that  S  vanishes  when  W  is 
solution  of  the  Euler  equations.  Integrating  by 
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parts  Eq.  (67)  and  exploiting  =  0  leads 

to 

^  /  v’^wdn^  -  f  v}j^t{w,x)dna, 

Jnit)  in(t) 

+  5(y^^E)  =  0 

(69) 

Integrating  the  above  equation  between  t"  and 
yields 


v'^wdn,-  /  v^wd^. 


-  /  /  dt  (70) 

+  /  5(y^TE)d^  =  0 


2.3.2.  The  Geometric  Conservation  Law 

Substituting  a  constant  field  IE  =  IE*  in 
Eq.  (70)  leads  to 


/  v^w*  dn^  -  I  v^w*  dn^ 

I  I  -xrh  -r-c/T/T/*  •  ^  jn 


where  /?i  are  constants,  can  also  be  carried  out 
exactly.  Indeed,  the  latter  condition  is  desirable 
not  only  for  ALE  computations,  but  also  for 
flow  computations  using  fixed  meshes.  Violat¬ 
ing  this  condition  will  introduce  artificial  fluxes 
throughout  the  mesh.  Therefore,  this  condi¬ 
tion  is  the  finite  element  form  of  the  Surface 
Conservation  Law  introduced  in  [39].  For  ex¬ 
ample,  if  the  weighting  functions  are  linear 
polynomials  over  each  element,  is  constant 
over  each  element  and  a  single  point  integra¬ 
tion  rule  will  yield  an  exact  integration  formula, 
provided  that  the  area/volume  of  the  element 
is  computed  exactly. 

Consequently,  provided  the  SCL  is  satis¬ 
fied,  and  for  weighting  functions  that  are  zero 
on  the  boundary,  it  follows  that 

/  EfjE-(IE*)  dfl.  =  0  (73) 

Ju 

Hence,  Eq.  (71)  can  be  rewritten  as 


e'^ie*  dn 


-f  f  vlT!{w\i)dn,dt 

Jn{t) 

7*" 

+  /  S'(E^IE*)dt  =  0 

(71) 

At  this  point,  it  is  essential  to  assume  that 
the  consistency  of  S  is  preserved  in  its  dis¬ 
crete  counterpart  (at  least  for  a  uniform  field), 
and  therefore  the  last  term  in  the  above  equa¬ 
tion  is  identically  zero.  From  Eq.  (68)  it  can 
be  observed  that  the  least-square  term  identi¬ 
fies  pointwise  with  zero,  and  hence  the  assump¬ 
tion  is  satisfied  independently  of  the  integration 
rule.  One  can  also  reasonably  assume  that  the 
first  and  second  terms  of  the  above  equation  can 
be  computed  exactly,  and  that  the  evaluation  of 
any  term  of  the  form 

/  V\ldi  d^^  =  0  (72) 


V^W*  da^ 


r  / 

Jt"  JQ(t) 


V’}xiW*dno=  dt  =  0 


and  can  be  simplified  to 


/  dQ^  -  I 
Jn(t"+i)  Jn 

7‘”^'  7 

-  /  /  Efi,. 


Ei  XidQ,x  dt  =  Q 


Eq.  (75)  establishes  the  geometric  conserva¬ 
tion  law  for  the  stabilized  finite  element  method 
with  an  ALE  formulation. 

2.3.3.  Implications  of  the  GCL 

In  order  to  find  the  appropriate  formula  for  inte¬ 
grating  exactly  the  last  term  of  the  above  GCL, 


8-19 


we  proceed  as  follows.  First,  we  introduce  the 
function 

G{T)  =  [  [  V^xi  dQ^dt  (76) 

Jt^  jQ(t) 

and  note  that  this  function  can  also  be  written 


G{T)  =  /  V^iax,T))dQ, 
Jq(t) 


f  v\a 


x,G))  dflx 


From  the  differentiation  of  Eqs.  (76,77)  it  fol¬ 
lows  that 


Vjxi  dQ,x 


V\T)  dQx 


Hence,  the  appropriate  formula  for  integrating 
exactly  the  last  term  in  Eq.  (75)  and  satisfy¬ 
ing  the  GCL  is  the  one  which  computes  exactly 

d  r 


f  V\T)  dilx. 
Ju(T) 


2. 3. A.  The  Two-Dimensional  Case 


Let  Nk  be  some  arbitrary  mapping  functions 
between  the  current  and  reference  configura¬ 
tions.  We  have 


Xl  —  ^  ^  .^fc(^)^fcl 
k=i 

^max 

NkU)xk2 


where  summation  is  assumed  over  repeated  in¬ 
dices,  and  Xki  are  given  by 

x»,(r)  =  «rK«+(i-<(r)K,) 

for  k  =  l..kmax'-,  i=l,2 

Here,  8{t)  satisfies  the  conditions  given  in 
Eq.  (45).  This  form  shows  that  the  matrix  in¬ 
volved  in  the  computation  of  J  is  a  linear  func¬ 
tion  of  (5,  and  therefore  J  is  a  quadratic  function 
of  6  that  can  be  written  as 

8“^  J2{i)  (82) 

The  function  G  can  now  be  rewritten  as 
G{T)  = 

Y.  /  (‘^o(o+^^i(0  +  ^V2(o)F'‘(e)de 

e  =  l 

(83) 

Therefore,  the  following  conclusions  can  be 
made 

•  G{T)  is  quadratic  in  8{T)^  and  since 

i 

=  /  KT)—Gi8{T))dT  (84) 

Jtn 

=  [ 

-^G(8{T))  is  linear  in  8  and  hence,  the  GCL 
condition  will  be  satisfied  if  the  midpoint  rule 
is  used  for  the  integration  of  the  last  term  in 
Eq.  (75). 


where  the  subscripts  1  and  2  designate  the  two 
different  coordinates,  and  the  subscripts  k  refer 
to  the  nodal  vertices  of  the  element.  The  Ja¬ 
cobian  J  of  the  above  transformation  is  given 

by 

J  =  det(^  U)=deti^xk,)  (80) 
V  di2  9^2  /  G 
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2.3.5.  The  Three-Dimensional  Case 


-x^)dn^  ids 

!(i)  ^ 

=  f  f  -x:)dn.  ds 

Jo  Jn(t) 

Jo  Jn{t) 


=  At  /  -Vt^^——^dn,  dS 

Jo  JQ(t) 

(88) 

This  in  turn  implies  that  the  mesh  velocity  x 
must  be  computed  as  follows 


x^+i  _ 


Similarly  to  the  two-dimensional  case,  the  map¬ 
ping  between  a  reference  and  current  element 

configuration  can  be  written  as  ,  ,  •  , ,  i  r  •  i  i  i  • 

and  making  the  change  or  variable  suggested  in 

^  Eq.  (81)  and  that  implies  Xi  =  —  xf), 

X,  = 

fcrrl 

k  —  1  ftl  p 

kn.a.  =  /  / 

X3  =  E  NkU>ks  do  Jn{t) 

‘=-  =A</7 

Jo  JQ(t) 

and  its  Jacobian  J  is  given  by  (88) 

This  in  turn  implies  that  the  mesh  velocity  x 
J  —  det{-^p-Xkj )  (86)  jjiust  be  computed  as  follows 

Following  the  same  reasoning  as  in  the  two-  ^  _  x"~*~^  —  x” 

dimensional  case,  the  following  conclusions  can  At 

be  made 

•  is  cubic  in  ^(T),  -jjG(S(T))  is  quadratic  summary,  the  following  formulae  apply 

in  6,  and  therefore  the  GCL  condition  will 

be  satisfied  if  the  two-point  rule  is  used  for  *  two-dimensional  flow  problems, 
the  integration  of  the  last  term  in  Eq.  (75). 

2.3.6.  Integration  Formulae 

As  discussed  above,  the  integrand  of  the  last 
term  of  the  geometric  conservation  law  (75)  can 
be  linear  or  quadratic.  For  a  linear  integrand, 
the  midpoint  rule  will  perform  an  exact  integra¬ 
tion.  For  a  quadratic  integrand,  the  two-point 
rule  must  be  employed.  In  all  cases,  Eq.  (78) 
holds  only  if  x  is  computed  in  a  manner  that  is 
compatible  with  the  deformation  of  0,(6) —  that 
is,  if  it  is  obtained  by  derivation  of  Eq.  (81). 

Recalling  that  we  are  interested  in  computing 

^n+i  where  the  superscript  k  depends  on  the  time 

/  /  —V!iXi  dOx  dt  (87)  discretization  of  the  flow  equation. 

./«"  Jaii) 
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•  three-dimensional  flow  problems: 


where  the  superscripts  kl  and  k2  depend  on  the 
time  discretization  of  the  flow  equation. 


2.4.  Impact  of  the  GCL  on  the  Temporal 
Solution  of  Aeroelastic  Problems 

The  most  remarquable  implication  of  the  GCL 
condition  is  the  constraint  it  imposes  on  the 
mesh  velocity  computation,  independently  of 
the  integration  formula  for  the  flow  equations 


-  x" 


At 


(92) 


This  formula  is  intuitive  and  has  been  “nat¬ 
urally”  used  by  several  investigators  indepen¬ 
dently  from  any  geometric  conservation  law 
(see,  for  example,  [15]).  However,  when  sophis¬ 
ticated  time-integrators  are  used  for  the  struc¬ 
ture  and/or  the  mesh  equations,  neither  the 

computed  mesh  velocities  x”'*'2  nor  the  com¬ 
puted  structural  velocites  on  the  fluid/structure 


interface  are  guaranteed  to  obey  2  = 
In  that  case,  satisfying  the  GCL  requires 


•  using  the  mesh  velocity  x”"*"  2  computed  by 
the  time-integrator,  only  for  evaluating  x"'*'^ . 

•  using  the  mesh  velocity  =  ^”"*2 7^" 

in  the  evaluation  of  the  fluid  fluxes. 


This  means  that  it  is  not  always  possible  to  re¬ 
spect  the  continuity  of  both  the  displacement 
and  velocity  fields  on  the  fluid  structure  bound¬ 
ary  as  prescribed  by  Eqs.  (23)  without  violating 
the  GCL.  For  example,  if  the  displacement  con¬ 
tinuity  condition  x{t)  —  q(t)  is  enforced  at  the 
fluid/structure  interface  Tjt’/^,  —  and  that  is 
usually  the  case  —  respecting  the  GCL  implies 
computing  a  mesh  velocity  field  on  Tp/s  that 
is  equal  to 


*  TI+  4 
X  ^  ^ 


n  +  1 


At 


At 


on  Tp/s 
(93) 

In  that  case,  satisfying  also  the  velocity  con¬ 
tinuity  condition  x(t)  =  q(t)  on  Tp/s  requires 
that 


^  ^  on  Tp/s  (94) 

which  is  not  enforced  by  all  structural  time- 
integrators.  Therefore,  it  is  not  always  possible 
to  satisfy  the  continuity  between  both  the  dis¬ 
placement  and  the  velocity  of  the  structure,  and 
those  of  the  fluid  mesh  at  the  fluid/structure  in¬ 
terface,  without  violating  the  GCL. 


Unfortunately,  a  discontinuity  between  the 
velocity  of  the  structure  and  that  of  the  fluid 
mesh  at  the  fluid/structure  interface  can  per¬ 
turb  the  energy  exchange  between  the  fluid  and 
the  structure.  However,  it  can  be  shown  that 
when  the  implicit  midpoint  rule  is  used  for  ad¬ 
vancing  the  structure  and  the  displacement  con¬ 
dition  x(t)  =  q(t)  is  enforced  on  Tp/s  using 
a  staggered  algorithm,  both  continuity  equa¬ 
tions  (23  )  can  be  enforced  without  violating 
the  GCL.  The  proof  goes  as  follows. 

Given  some  initial  conditions  q°  and  q°, 
suppose  that  the  mesh  motion  is  initialized  such 
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that  the  following  holds  on  the  fluid/structure 
interface 

onr^/5  (95) 

Also  suppose  that  at  each  time-station  t”,  the 
continuity  of  the  velocity  field  is  enforced  on  the 
fiuid/structure  boundary 

i"  =  on  Tf/s  (96) 


If  the  midpoint  rule  is  used  for  time-integrating 
the  structural  equations  of  motion,  and  the  dy¬ 
namic  fluid  mesh  is  updated  consistently  with 
the  GCL  as  in  Eq.  (92),  it  can  be  proved  by 
induction  that 

onTp/s  (97) 

Indeed,  the  above  relation  holds  at  n  =  0.  As¬ 
suming  it  holds  at  n,  it  follows  that 

=x^-i  +Atx^ 

=  9"  -  ^9*  +  At?"  (98) 

=  9“  +  ^'!’ 


Since  the  midpoint  rule  algorithm  applied  to 
the  structural  equations  implies 


n  +  1  n 


/  *71  I  •7^4-l ' 


(99) 


it  follows  that 


which  completes  the  proof  by  induction  of 
Eq.  (97). 

Now,  a  staggered  algorithm  for  solving  the 
coupled  Eqs.  (22)  can  be  described  as  follows 

1)  using  the  mesh  displacement  and 

the  mesh  velocity  x"  that  matches  the 


structural  velocity  q"  on  update  the 

mesh  as  follows 

=x"-n  Atx”  (101) 

2)  using  x”“^,  and  x",  update  the 

fluid  state  vector  in  a  manner  that 

satisfies  the  GCL 

3)  using  the  pressure  computed  from  ^ 

compute  and  q"”^^  using  the  mid¬ 

point  rule 

Defining  x"  as 


and  substituting  Eq.  (101)  into  Eq.  (102)  leads 
to  . 

x”=x"-n^i"  (103) 

which  in  view  of  Eqs.  (97,96)  yields 


x”  =  qi”  on  E/p/s  (104) 


and  demonstrates  that,  when  the  midpoint  rule 
is  used  for  time-integrating  the  structure  and  a 
proper  staggered  procedure  is  used  for  solving 
the  coupled  fluid/structure  problem,  the  con-, 
tinuity  of  both  the  displacement  and  velocity 
fields  ca  be  enforced  on  Fp/s  without  violating 
the  GCL. 


2.5.  Numerical  Example 


In  order  to  highlight  the  impact  of  the  GCL 
on  coupled  aeroelastic  computations,  we  con¬ 
sider  here  the  simulation  of  the  two-dimensional 
transient  aeroelastic  response  of  a  flexible  panel 
in  a  transonic  regime.  The  panel  is  represented 
by  its  cross  section  that  is  assumed  to  have  a 
unit  length  and  a  uniform  thickness  and  Young 
modulus,  and  to  be  clamped  at  both  ends. 
This  rectangular  cross  section  is  discretized  into 
plane  strain  d-node  elements  with  perfect  as¬ 
pect  ratios.  The  two-dimensional  flow  domain 
around  the  panel  is  discretized  into  triangles, 
and  the  Euler  equations  are  used  for  this  com¬ 
putation.  The  free  stream  Mach  number  is  set 
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to  Moo  =  0.8,  and  a  slip  condition  is  imposed 
at  the  fluid/structure  boundary.  Further  details 
on  the  specifics  of  this  simulation  are  deferred 
to  Section  9. 

Initially,  a  steady-state  flow  is  computed 
around  the  panel  at  Moo  =  0.8.  Next,  this  flow 
is  perturbed  via  an  initial  displacement  of  the 
panel  that  is  proportional  to  its  second  funda¬ 
mental  mode,  and  the  subsequent  panel  motion 
and  flow  evolution  are  computed  using  one  of 
the  staggered  explicit/implicit  fluid/structure 
procedures  described  in  the  following  section. 
Two  computed  histories  of  the  lift  using  the 
same  time-step  are  reported  in  Fig.  8  for  the 
case  where  the  GCL  is  violated  by  updating  the 
mesh  velocity  field  at  the  fluid/structure  inter¬ 
face  via  a  higher-order  scheme  than  that  given 
in  Eq.  (92),  and  in  Fig.  9  for  the  case  where 
the  GCL  is  respected.  Clearly,  this  example 
demonstrates  the  impact  of  the  GCL  on  aeroe- 
lastic  computations  as  it  shows  that  violating 
this  law  leads  to  undesirable  spurious  oscilla¬ 
tions  in  the  lift  prediction. 


Fig.  8.  Lift  history  when  the  GCL  is  violated 


Fig.  9.  Lift  history  when  the  GCL  is  obeyed 

3.  A  FAMILY  OF  STAGGERED 
SOLUTION  PROCEDURES  [23,44,45] 

In  Section  1,  we  have  shown  that  in  the  linear 
theory,  the  flutter  speed  of  an  aircraft  can  be 
obtained  directly  from  the  solution  of  an  eigen¬ 
value  problem.  In  the  nonlinear  theory,  predict¬ 
ing  whether  an  aircraft  will  flutter  or  not  for  a 
given  set  of  flight  conditions  is  determined  by 
computing  the  solution  of  Eqs.  (22),  and  estab¬ 
lishing  numerically  whether  this  solution  grows 
continuously  in  time  or  not.  In  other  words, 
a  linear  aeroelastic  dynamic  stability  problem 
can  be  solved  without  computing  explicitly  the 
response  of  the  structure,  but  a  nonlinear  aeroe¬ 
lastic  dynamic  stability  problem  is  typically 
solved  by  simulating  a  set  of  corresponding  non¬ 
linear  response  problems.  Hence,  transient  non¬ 
linear  aeroelastic  investigations  are  in  general 
computationally  intensive.  For  example,  estab¬ 
lishing  the  transonic  flutter  boundary  of  an  air¬ 
craft  for  a  given  set  of  aeroelastic  parameters 
requires  about  30  aeroelastic  response  analyses, 
which  clearly  demonstrates  the  need  for  a  fast 
capability  for  solving  Eqs.  (22).  Such  a  capabil¬ 
ity  requires  not  only  powerful  supercomputers, 
but  also  powerful  computational  methodologies 
and  algorithms. 

One  approach  for  solving  the  three-way 
coupled  aeroelastic  problem  described  in  Eqs.  (22) 
is  known  as  the  “monolithic  augmentation” 
approach  where,  as  specific  problems  arise,  a 
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large-scale  single  computer  program  —  for  ex¬ 
ample,  a  finite  element  structural  analysis  code 

—  is  expanded  to  house  more  interaction  ef¬ 
fects  —  for  example,  fluid/structure  interac¬ 
tion.  Such  an  approach  poses  several  diffi¬ 
culties,  most  of  which  are  related  to  the  fact 
that  each  of  the  three  components  of  the  three- 
way  coupled  aeroelastic  problem  described  in 
Eqs.  (22)  has  different  mathematical  and  nu¬ 
merical  properties,  and  distinct  software  imple¬ 
mentation  requirements.  Some  of  these  diffi¬ 
culties  have  been  mentioned  in  Section  1,  oth¬ 
ers  are  summarized  in  [20].  In  our  opinion,  the 
monolithic  augmentation  approach  is  unattrac¬ 
tive  because  once  it  is  implemented,  it  can¬ 
not  easily  accommodate  neither  new  or  im¬ 
proved  problem  formulations,  nor  future  ad¬ 
vances  within  any  of  the  computational  fluid 
and/or  structural  dynamics  disciplines. 

Alternatively,  the  solution  of  Eqs.  (22) 
can  be  obtained  through  a  staggered  procedure 
in  which  separate  fluid  and  structural  analy¬ 
sis  programs  —  often  called  field  analyzers  [20] 

—  execute  and  exchange  data.  Such  an  ap¬ 
proach  is  also  known  as  partitioned  analysis. 
It  offers  several  appealing  features,  including 
the  ability  to  use  well  established  discretization 
and  solution  methods  within  each  discipline, 
simplification  of  software  development  efforts, 
reuse  of  existing  and  validated  code,  accom¬ 
modation  of  future  single  discipline  improve¬ 
ments,  and  preservation  of  software  modular¬ 
ity.  Traditionally,  nonlinear  transient  aeroelas¬ 
tic  problems  have  been  solved  via  the  simplest 
possible  staggered  procedure  where  the  sepa¬ 
rate  fluid  and  structural  analysis  programs  ex¬ 
ecute  in  a  strictly  sequential  fashion,  and  ex¬ 
change  strictly  interface-state  data  such  as  pres¬ 
sures  and  velocities  at  each  single  time-step 
(see,  for  example,  [15,16,24-27]).  The  objective 


of  this  section  is  to  overview  a  broader  fam¬ 
ily  of  more  powerful  staggered  solution  proce¬ 
dures  that  address  some  important  issues  re¬ 
lated  to  numerical  stability,  subcycling,  accu¬ 
racy  vs.  speed  trade-offs,  implementation  on 
heterogeneous  computing  platforms,  and  inter¬ 
field  as  well  as  intra-field  parallel  processing. 

3.1.  Preliminaries 

Of  course,  the  global  performance  of  a  parti¬ 
tioned  analysis  for  solving  the  time-dependent 
Eqs.  (22)  depends  on  the  local  performances 
of  the  fluid  and  structural  field  analyzers.  But 
more  importantly,  the  global  performance  also 
depends  on  the  stability  and  accuracy  proper¬ 
ties  of  the  staggered  solution  procedure  itself. 
For  a  given  prescribed  accuracy,  the  more  stable 
a  staggered  algorithm  is,  the  larger  is  the  allow¬ 
able  coupled  time-integration  step,  and  there¬ 
fore  the  faster  is  the  total  solution  time.  Hence, 
our  primal  goal  is  to  construct  partitioned  anal¬ 
ysis  procedures  for  Eqs.  (22)  with  superior  sta¬ 
bility  properties. 

REMARK  2:  The  reader  is  reminded  that  the 
stability  properties  of  a  staggered  solution  algo¬ 
rithm  depend,  among  other  things,  on  the  sta¬ 
bility  properties  of  the  field  analyzers.  However, 
it  is  also  well-known  that  using  an  uncondition¬ 
ally  stable  time  integration  algorithm  in  each 
field  analyzer  does  not  guarantee  the  uncondi¬ 
tional  stability  of  the  overall  staggered  solution 
algorithm. 

Because  the  aeroelastic  response  of  a  struc¬ 
ture  is  often  dominated  by  low  frequency  dy¬ 
namics,  we  consider  only  implicit  schemes  for 
time-integrating  the  structural  displacement 
field.  However,  we  consider  both  explicit  and 
implicit  time-integrators  for  advancing  the  fluid 
field,  as  both  approaches  are  popular  in  compu¬ 
tational  fluid  dynamics.  On  the  other  hand,  we 
also  note  that  time-accurate  implicit  and  un¬ 
structured  flow  solvers  seem  to  be  less  avail¬ 
able  than  their  explicit  counterparts.  In  the 
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sequel,  we  refer  to  a  partitioned  analysis  pro¬ 
cedure  as  an  explicit/implicit  one  if  an  explicit 
time-accurate  flow  solver  is  employed,  and  as  an 
implicit/implicit  one  if  an  implicit  flow  solver 
is  used.  In  the  implicit/implicit  case,  our  goal 
is  to  devise  an  unconditionally  stable  staggered 
algorithm,  or  at  least  a  partitioned  procedure 
that  allows  a  relatively  large  time  step.  In  the 
explicit/implicit  case,  our  objective  is  to  design 
a  staggered  solution  algorithm  whose  stability 
limit  is  not  worse  than  that  of  the  underly¬ 
ing  explicit  flow  solver.  These  are  not  trivial 
tasks  because  coupling  effects  can  restrict  the 
stability  limits  of  the  independent  field  time- 
integrators. 

Next,  we  make  the  following  observations 

•  linear  and  nonlinear  transient  fluid  /struc¬ 
ture  interaction  problems  have  one  partic¬ 
ularity:  they  possess  a  wide  variety  of  self- 
excited  vibrations  and  instabilities.  We 
have  already  mentioned  the  flutter  prob¬ 
lem.  Another  example  of  a  dynamic  in¬ 
stability  is  that  of  the  vibrations  due  to 
Von  Karman  vortices  [43].  If  the  fre¬ 
quency  of  the  structure  loading  caused  by 
the  vortices  is  close  or  equal  to  the  nat¬ 
ural  frequency  of  the  body,  then  a  reso¬ 
nance  effect  is  present  and  large  ampli¬ 
tudes  of  vibrations  result.  Therefore,  when 
it  comes  to  analyzing  the  numerical  sta¬ 
bility  of  a  proposed  staggered  algorithm 
for  time-integrating  fluid/structure  inter¬ 
action  problems,  it  is  essential  to  consider 
the  case  where  the  coupled  system  is  physi¬ 
cally  stable  —  that  is,  when  Eqs.  (22)  have 
a  solution  that  does  not  grow  indefinitely 
in  time. 

•  when  the  structure  undergoes  small  dis¬ 
placements,  the  fluid  mesh  can  be  frozen 
and  “transpiration”  fluxes  can  be  intro¬ 
duced  at  the  fluid  side  of  the  fluid/ structure 
boundary  to  account  for  the  motion  of  the 


structure.  In  that  case,  the  nonlinear  tran¬ 
sient  aeroelastic  problem  simplifies  from  a 
three-  to  a  two-field  coupled  problem. 

•  most  fluid/structure  instability  problems 
can  be  analyzed  by  investigating  the  re¬ 
sponse  of  the  coupled  system  to  a  pertur¬ 
bation  around  a  steady  state.  If  the  re¬ 
sponse  is  an  amplification  of  the  initial  per¬ 
turbation,  it  is  an  indication  that  the  sys¬ 
tem  is  unstable.  If  it  is  a  dissipation  of  the 
initial  perturbation,  it  means  that  the  sys¬ 
tem  is  stable.  This  suggests  that  aeroelas¬ 
tic  stability  or  instability  problems  can  be 
investigated  by  linearizing  the  flow  around 
an  equilibrium  position  Wq,  and  analyzing 
the  response  of  the  fluid/structure  system 
to  a  perturbation. 

Based  on  the  above  observations,  the  au¬ 
thors  of  reference  [23]  have  constructed  a  sim¬ 
plified  but  relevant  aeroelastic  “test”  problem 
where  the  coupled  fluid/structure  system  is  al¬ 
ways  physically  stable.  They  have  also  pre¬ 
sented  a  mathematical  framework  for  analyzing 
the  accuracy  and  stability  properties  of  stag¬ 
gered  procedures  applied  to  the  solution  of  their 
test  problem.  Subsequently,  this  test  problem 
was  also  shown  to  be  a  good  model  problem 
for  the  complex  nonlinear  aeroelastic  systems 
that  we  are  interested  in  solving  [23,18,19].  In 
the  test  problem,  the  structure  is  assumed  to 
remain  in  the  linear  regime,  and  the  flow  is  lin¬ 
earized  around  an  equilibrium  position  of  the 
fluid  state  vector  denoted  here  by  Wo-  The 
semi-discrete  equations  governing  this  coupled 
aeroelastic  model  problem  are  given  by  (see  [23] 
for  details) 

f  _  f  A*  b  \  f  6W\ 

{q  )  -  [c  iy*)\Q  ) 


(105) 
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where  SW  is  the  perturbed  fluid  state  vector, 

Q  =  (  ?  I  is  the  structure  state  vector,  A*  re- 

suits  from  the  spatial  discretization  of  the  flow 
equations,  B  is  the  matrix  induced  by  the  tran¬ 
spiration  fluxes  at  the  fluid/structure  boundary 
r F/Si  C  is  the  matrix  that  transforms  the  fluid 
pressure  on  Tp/s  into  prescribed  structural 

forces,  and  D*  = 

where  as  before,  M,  D,  and  K  are  the  struc¬ 
tural  mass,  damping,  and  stiffness  matrices. 


In  [19],  the  aeroelastic  model  problem  de¬ 
scribed  in  Eqs.  (105)  has  been  extended  to  in¬ 
clude  the  mesh  motion  of  the  fluid  grid,  and 
therefore  to  truly  represent  the  three-way  cou¬ 
pled  aeroelastic  problem  governed  by  Eqs.  (22). 
More  importantly,  reference  [19]  discusses  a 
methodology  for  considering  a  staggered  solu¬ 
tion  procedure  that  was  designed  for  solving 
the  three-field  equivalent  of  the  model  prob¬ 
lem  (105),  and  extending  it  to  the  case  of  non¬ 
linear  transient  aeroelastic  problems  such  as 
those  governed  by  Eqs.  (22). 


In  this  section,  we  overview  a  family  of  par¬ 
titioned  analysis  procedures  for  solving  the  non¬ 
linear  transient  coupled  Eqs.  (22).  These  algo¬ 
rithms  are  based  on  the  mathematical  results 
established  in  [23,19],  and  have  recently  been 
described  in  [44,45].  Rather  than  discussing 
mathematical  proofs  and  details  that  can  be 
found  in  [23,19],  we  emphasize  important  com¬ 
putational  and  implementational  issues  per¬ 
taining  to  accuracy,  stability,  distributed  com¬ 
puting,  I/O  transfers,  subcycling,  and  parallel 
processing. 


In  order  not  to  obscure  the  following  discus¬ 
sion  by  the  complex  notation  needed  for  three- 
dimensional  viscous  flows,  we  focus  here,  with¬ 
out  any  loss  of  generality,  on  the  case  of  two- 
dimensional  Euler  flows  discretized  by  the  finite 
volume  method.  For  three-dimensional  invis- 
cid  flows,  Eqs.  (58)  should  be  used  instead  of 
Eqs.  (51).  For  finite  element  and/or  space-time 
discretizations,  Eqs.  (51)  should  be  replaced  by 
the  appropriate  equations  derived  in  Section  2. 

From  the  results  established  in  Section  2,  it 
follows  that  the  semi-discrete  equations  govern¬ 
ing  the  three-way  coupled  aeroelastic  problem 
can  be  written  in  that  case  as 


(106) 


where  the  superscript  k  depends  on  the  time 
discretization  of  the  fluid  flow  equations. 

In  many  aeroelastic  investigations  such  as 
wing  flutter  problems,  first  a  steady  flow  is  com¬ 
puted  around  a  structure  in  equilibrium.  Next, 
the  structure  is  perturbed  via  an  initial  dis¬ 
placement  and/or  velocity  and  the  aeroelastic 
response  of  the  coupled  fluid/structure  system 
is  analyzed.  This  suggests  that  a  natural  se¬ 
quencing  for  the  staggered  time-integration  of 
Eqs.  (106)  is 

1.  perturb  the  structure  via  some  initial  con¬ 
ditions. 


update  the  fluid  grid  to  conform  to  the  new 
structural  boundary. 

advance  the  flow  with  the  new  boundary 
conditions. 


8-27 


4.  advance  the  structure  with  the  new  pres¬ 
sure  load. 

5.  repeat  from  step  2  until  the  objective  of 
the  simulation  is  reached. 


An  important  feature  of  partitioned  solu¬ 
tion  procedures  is  that  they  allow  using  exist¬ 
ing  single  discipline  software  modules.  In  our 
work,  we  have  been  particularly  interested  in  re¬ 
using  the  massively  parallel  explicit  flow  solver 
described  in  [46-49]  for  two-dimensional  prob¬ 
lems,  and  a  variant  for  three-dimensional  appli¬ 
cations.  Therefore,  we  consider  here  the  case 
where  the  semi-discrete  fluid  equations  are  in¬ 
tegrated  with  a  3-step  variant  of  the  explicit 
Runge-Kutta  algorithm.  Of  course,  other  ex¬ 
plicit  time-integrators  can  be  equally  employed. 
On  the  other  hand,  the  aeroelastic  response  of 
a  structure  is  often  dominated  by  low  frequency 
dynamics.  Hence,  the  structural  equations  are 
most  efficiently  solved  by  an  implicit  time- 
integration  scheme.  For  example,  we  select  to 
time-integrate  the  structural  motion  with  the 
implicit  midpoint  rule  because  it  allows  enforc¬ 
ing  both  continuity  Eqs.  (23)  while  still  respect¬ 
ing  the  GCL  (see  Section  2).  Consequently, 
we  propose  the  following  explicit/implicit  so¬ 
lution  algorithm  for  solving  the  three-field  cou¬ 
pled  problem  (106). 


Given  a  steady  flow  and  initial  structural  conditions 


1.  Update  the  dynamic  fluid  grid 


Solve  Mx”+^  +  Dx”+^  +  Kx”+^  =  Kc  q” 


_  x"+^-x" 


X  ^2  = 


At 


2.  Advance  the  fluid  system  using  RK3 


+ 


=  Wi 

_  Vij^  ) 

1  1 


U(x"+i)  4  —  k 


K-(x"+i) 

x”+^  x"+^) 

A;  =  1,  2,  3 

=  w! 


3.  Advance  the  structure  using  the  midpoint  rule 


Mq"+1  q. 

T 

q 

♦  n 

q 


=q"  +  ^(q"+q"+^) 
-q”  +  ^(q"+q”+') 


(107) 


In  the  sequel,  we  refer  to  the  above  ex¬ 
plicit/implicit  staggered  solution  procedure  as 
ALGO.  It  is  graphically  depicted  in  Fig.  10. 
Extensive  numerical  simulations  using  this  al¬ 
gorithm  have  shown  that  its  stability  limit  is 
governed  by  the  critical  time-step  of  the  explicit 
fluid  solver,  and  therefore  is  not  worse  than  that 
of  the  underlying  fluid  explicit  time-integrator. 
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The  3-step  Runge-Kutta  algorithm  is  third- 
order  accurate  for  linear  problems  and  second- 
order  accurate  for  nonlinear  ones.  The  mid¬ 
point  rule  is  second-order  accurate.  A  simple 
Taylor  expansion  shows  that  the  partitioned 
analysis  procedure  ALGO  is  first-order  accu¬ 
rate  when  applied  to  the  linearized  Eqs.  (105). 
When  applied  to  Eqs.  (106),  its  accuracy  de¬ 
pends  on  the  solution  scheme  selected  for  solv¬ 
ing  the  mesh  equations.  As  long  as  the  time- 
integrator  applied  to  the  last  of  Eqs.  (106)  is 
consistent,  ALGO  is  guaranteed  to  be  at  least 
first-order  accurate. 


Fig.  10.  ALGO:  the  basic  staggered  algorithm 


3.2.2.  ALGl:  subcy cling 

The  fluid  and  structure  fields  have  often  differ¬ 
ent  time  scales.  For  problems  in  aeroelasticity, 
the  fluid  flow  usually  requires  a  smaller  tem¬ 
poral  resolution  than  the  structural  vibration. 
Therefore,  if  ALGO  is  used  to  solve  Eqs.  (106), 
the  coupling  time-step  AG  will  be  typically  dic¬ 
tated  by  the  stability  time-step  of  the  fluid  sys¬ 
tem  Aff  and  not  the  time-step  Afs  >  A^f  that 
meets  the  accuracy  requirements  of  the  struc¬ 
tural  field. 

Using  the  same  time-step  At  in  both  fluid 
and  structure  computational  kernels  presents 
only  minor  implementational  advantages.  On 
the  other  hand,  subcycling  the  fluid  computa¬ 
tions  with  a  factor  ng/F  =  ^tg/AtF  can  offer 
substantial  computational  advantages,  includ¬ 
ing 


•  savings  in  the  overall  simulation  CPU 
time,  because  in  that  case  the  structural 
field  will  be  advanced  fewer  times. 

•  savings  in  I/O  transfers  and/or  communi¬ 
cation  costs  when  computing  on  a  hetero¬ 
geneous  platform,  because  in  that  case  the 
fluid  and  structure  kernels  will  exchange 
information  fewer  times. 

However,  the  computational  advantages 
highlighted  above  are  effective  only  if  subcy¬ 
cling  does  not  restrict  the  stability  region  of  the 
staggered  algorithm  to  values  of  the  coupling 
time-step  AG  that  are  small  enough  to  offset 
these  advantages.  In  [23],  it  is  shown  that  for  < 
the  linearized  problem  (105),  the  straightfor¬ 
ward  conventional  subcycling  procedure  —  that 
is,  the  scheme  where  at  the  end  of  each  ng/F 
fluid  subcycles  only  the  interface  pressure  com¬ 
puted  during  the  last  fluid  subcycle  is  transmit¬ 
ted  to  the  structure  —  lowers  the  stability  limit 
of  ALGO  to  a  value  that  is  less  than  the  critical 
time-step  of  the  fluid  explicit  time-integrator. 

On  the  other  hand,  it  is  also  shown  in  [23]  that 
when  solving  Eqs.  (105),  the  stability  limit  of 
ALGO  can  be  preserved  if 

•  the  deformation  of  the  fluid  mesh  between 

G  and  is  evenly  distributed  among 

the  ng/F  subcycles. 

•  at  the  end  of  each  ng/F  fluid  subcycles, 
the  average  of  the  interface  pressure  field 
P^F/s  computed  during  the  subcycles  be¬ 
tween  G  and  G'^’-  is  transmitted  to  the 
structure  rather  than  the  last  computed 
pressure. 

Hence,  we  propose  the  following  explicit /implic: 
fluid-subcycled  partitioned  procedure  for  solv¬ 
ing  Eqs.  (106). 
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(108) 


In  the  sequel,  we  refer  to  the  explicit/implicit 
fluid-subcycled  staggered  solution  procedure 
presented  here  as  ALGl.  It  is  graphically  de¬ 
picted  in  Fig.  11.  Extensive  numerical  ex¬ 
periments  have  shown  that  for  medium  values 
of  ns/F,  the  stability  limit  of  ALGl  is  gov¬ 
erned  by  the  critical  time-step  of  the  explicit 
flow  solver.  However,  experience  has  also  shown 
that  there  exists  a  maximum  subcycling  factor 
beyond  which  ALGl  becomes  numerically  un¬ 
stable. 

From  the  theory  developed  in  [23]  for  the 
linearized  Eqs.  (105),  it  follows  that  ALGl  is 
first-order  accurate,  and  that  as  one  would  have 
expected,  subcycling  amplifies  the  fluid  errors 
by  the  factor  Us/f- 


Fig.  11.  ALGl:  subcycling 


ALGO  and  ALGl  are  inherently  sequential.  In 
both  partitioned  analysis  procedures,  the  fluid 
system  must  be  updated  before  the  structural 
system  can  be  advanced.  Of  course,  ALGO 
and  ALGl  allow  intra-field  parallelism  (parallel 
computations  within  each  discipline),  but  they 
inhibit  inter-field  parallelism.  Advancing  the 
fluid  and  structural  systems  simultaneously  is 
appealing  because  it  can  reduce  the  total  simu¬ 
lation  time. 

A  simple  variant  ALG2  of  ALGl  —  or 
ALGO  if  subcycling  is  not  desired  —  that  allows 
inter-field  parallel  processing  is  given  next. 
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Clearly,  the  fluid  and  structure  kernels  can  run 
in  parallel  during  the  time-interval  [t„,  in+ng/p]- 
Inter-field  communication  or  I/O  transfer  is 
needed  only  at  the  beginning  of  each  time- 
interval. 

The  basic  steps  of  ALG2  are  graphically 
depicted  in  Fig.  12.  The  theory  developed 
in  [23]  shows  that  for  the  linearized  Eqs.  (105), 
ALG2  is  first-order  accurate,  and  parallelism  in 
ALG2  is  achieved  at  the  expense  of  amplified 
errors  in  the  fluid  and  structure  responses. 


Fig.  12.  ALG2:  subcycling 
and  inter-field  parallelism 

In  order  to  improve  the  accuracy  of  the 
basic  parallel  time-integrator  ALG2,  we  pro¬ 
pose  to  exchange  correction  type  information 
between  the  fluid  and  structure  kernels  at  half¬ 
step  in  the  following  specific  manner  (ALG3). 
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1,  2,  3 


1,  2,  3 


(110) 
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Algorithm  ALG3  is  illustrated  in  Fig.  13. 
The  first-half  of  the  computations  is  identical  to 
that  of  ALG2,  except  that  the  fluid  system  is 

"5/F 

subcycled  only  up  to  f”"*"  2  ,  while  the  struc¬ 

ture  is  advanced  in  one  shot  up  to 

"S/F 

At  f”'*'  2  ^  the  fluid  and  structure  kernels 

exchange  pressure,  displacement  and  velocity 
information.  In  the  second-half  of  the  com¬ 
putations,  the  fluid  system  is  subcycled  from 
^  to  using  the  new  structural 

information,  and  the  structural  behavior  is  re¬ 
computed  in  parallel  using  the  newly  received 
pressure  distribution.  Note  that  the  first  eval¬ 
uation  of  the  structural  state  vector  can  be  in¬ 
terpreted  as  a  prediction  step,  and  the  second 
as  a  correction  step. 

It  can  be  shown  that  when  applied  to  the 
linearized  Eqs.  (105),  ALG3  is  first-order  ac¬ 
curate  and  reduces  the  errors  of  ALG2  by  the 
factor  ris/F,  at  the  expense  of  one  additional 
communication  step  or  I/O  transfer  during  each 
coupled  cycle  (see  [23]  for  a  detailed  error  anal¬ 
ysis). 


Fig.  13.  ALG3:  subcycling,  inter-field 
parallelism  and  improved  accuracy 


3.3.  Implicit /implicit  staggered  algorithms 

Clearly,  the  partitioned  analysis  procedures 
ALGO,  ALGl,  ALG2  and  ALG3  can  be  equally 
employed  with  an  implicit  flow  solver.  However, 
it  is  shown  in  [23]  that  in  order  for  these  parti¬ 
tioned  procedures  to  be  unconditionally  stable. 


not  only  an  unconditionally  stable  implicit  flow 
solver  must  be  used,  but  also  an  interface  cou¬ 
pling  operator  must  be  exchanged  between  the 
structure  and  fluid  field  analyzers.  For  further 
details  on  this  topic,  we  refer  the  reader  to  [23]. 

3U.  Implementation  of  subcyclinq 

We  have  pointed  out  in  Section  3.2.2  that  when 
subcycling  is  desired,  the  deformation  of  the 
fluid  mesh  between  and  should  not  be 
entirely  applied  during  the  first  fluid  subcy¬ 
cle,  but  evenly  distributed  across  all  subcycling 
stages.  There  are  many  ways  this  can  be  ac¬ 
complished,  including  the  following  one. 

At  the  beginning  of  time-step  the 

fluid  code  has  access  to  the  component  of  the 
structural  state  vector  (^”,  that  re¬ 

lates  to  the  degrees  of  freedom  located  at  the 
fluid/structure  interface.  The  objective  of  any 
mesh  updating  strategy  is  to  exploit  this  infor¬ 
mation  and  compute  a  fluid  mesh  displacement 

that  satisfies  the  continuity  Eqs.  (23) 

x^-q;x^=q  on  Lf/s  (Hi) 

We  note  that  the  difference  in  the  super¬ 
scripts  between  the  left  and  right  hand  sides 
of  Eqs.  (Ill)  is  due  to  the  staggered  nature 
of  the  solution  scheme,  and  that  the  second  of 
Eqs.  (Ill)  should  be  enforced  only  if  it  does 
not  violate  the  GCL.  Using  Eqs.  (Ill)  as  pre¬ 
scribed  boundary  values,  the  pseudo-structural 
equations  of  motion  of  the  dynamic  fluid  grid 
can  be  solved  to  obtain  an  updated  fluid  mesh 
displacement  The  details  of  this  partic¬ 

ular  computation  are  discussed  in  Section  7. 
Then,  at  every  subcycling  stage,  a  new  set  of 
prescribed  grid  boundary  displacements  can  be 
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generated  for  computing  the  subcycled  mesh 
position  as  follows 


_  n  —  \ 

^^F/S 

tt'T' 

^  F/S 

II 

_  n 

^F/S 

yX  •  T2-  A  #  \ 


where  is  an  interpolation  scheme  of  order 
d  =  0,  1,  2.  More  specifically,  is  defined  by 


4.  THE  FLOW  SOLVER  146—491 


So  far,  no  restriction  has  been  imposed  on  the 
nonlinear  flow  solver  technology,  except  for  the 
requirement  of  satisfying  the  GCL.  Hence,  flow 
solvers  based  on  an  ALE  finite  volume/element 
discretization  or  a  space-time  finite  element  for¬ 
mulation  can  be  equally  employed  within  the 
computational  framework  presented  in  this  pa¬ 
per  for  solving  nonlinear  transient  aeroelastic 
problems.  The  GCLs  for  all  of  these  approx¬ 
imation  methods  have  been  presented  in  Sec¬ 
tion  2. 


In  summary,  at  each  subcycle  a  new  set 
of  prescribed  boundary  displacements  are  com¬ 
puted  for  the  dynamic  fluid  grid,  and  the  equa¬ 
tions  of  motion  of  the  corresponding  pseudo- 
structural  system  are  solved  in  order  to  update 
the  positions  of  the  remaining  grid  points.  In 
practice,  we  have  found  that  d  =  1  is  the  best 
choice  for  the  interpolation  scheme.  Among 
other  things,  this  choice  does  not  require  trans¬ 
mitting  any  structural  velocity  information  to 
the  fluid  computational  kernel.  In  all  cases,  the 
fluid  mesh  velocity  x  must  be  computed  via  Eq. 
(50)  in  order  to  satisfy  the  GCL. 


In  our  case,  we  have  opted  for  a  mixed 
finite  element /volume  ALE  formulation  based 
on  unstructured  triangular  meshes  in  two- 
dimensional  problems,  and  unstructured  tetra- 
hedra  in  three-dimensional  ones.  This  ap¬ 
proach  combines  a  Galerkin  centered  approxi¬ 
mation  for  the  viscous  terms,  and  a  Roe  upwind 
scheme  [50]  for  the  computation  of  the  convec¬ 
tive  fluxes.  Higher  order  accuracy  is  achieved 
through  the  use  of  a  piecewise  linear  interpo¬ 
lation  method  that  follows  the  principle  of  the 
MUSCL  (Monotonic  Upwind  Scheme  for  Con¬ 
servative  Laws)  procedure  [51-53].  The  corre¬ 
sponding  ALE  flow  solvers  are  the  result  of  a 
collaboration  with  the  Projet  Sinus  at  INRIA 
Sophia- Antipolis  and  are  overviewed  below. 


jatial  discretization 


The  conservative  form  of  the  equations  describ¬ 
ing  viscous  flows  can  be  written  in  ALE  form 
as 


+  jv,.r(w,x)  =  ^v.mw) 

=  T{W)-iW 
(114) 

where,  TZ  denotes  the  diffusive  fluxes.  Re  is  the 
Reynolds  number,  and  as  for  the  case  of  Euler 
flows  and  Eqs.  (25),  J  =  det(dx/d^)  is  the 
Jacobian  of  the  frame  transformation  ^  x^W 
denotes  the  fluid  conservative  variables,  de¬ 
notes  the  convective  ALE  fluxes,  and  x  = 
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is  the  ALE  grid  velocity  that  may  be  different 
from  the  fluid  velocity  and  from  zero. 

The  boundary  r(t)  of  the  flow  domain 
is  partitioned  into  a  wall  boundary  E.^^  (i) 
corresponding  to  the  fluid/structure  interface 
boundary  Ejp/s,  and  an  infinity  boundary  roo(t) 

T{t)  =  r^(t)uroo(t)  (115) 

Let  Uw  and  T-a,  denote  the  wall  velocity  and 
temperature.  On  the  wall  boundary  Tw{t),  a 
no-slip  and  a  temperature  Dirichlet  conditions 
are  imposed 


U  T  =  (116) 

and  no  boundary  condition  is  specified  for  the 
density.  Hence,  the  total  energy  per  unit  of 
volume  and  the  pressure  on  the  wall  are  given 

by 


p  =  (7  —  l)pCvTw]  E  =  pCvTw  +  -p  II  Uw  11^ 

(117) 

For  external  flows  around  aircraft  structures, 
the  viscous  effects  are  assumed  to  be  negligible 
at  infinity,  and  therefore  a  uniform  free-stream 
state  vector  Woo  is  imposed  on  roo(t)- 

Following  the  procedure  described  in  de¬ 
tails  in  Section  2.1,  Eqs.  (114)  can  be  trans¬ 
formed  into 


dt 


W  -h 


Ciit) 


vE\w,x)  dn. 


dCi(t) 


=  [  -^VR{w)  da, 

Jc:(t) 


(118) 


Integrating  Eq.  (118)  by  parts  leads  to 


A 

dt 


L 


w  da. 


(t) 


+E 


da 


dCij(x) 


+ 


/ 


E\W,x).iyi  da 


(119) 


9Ci(t)nr(t) 


-A  E  [nfvwf 


dO/x 


where  (j)f  is  the  finite  element  shape  function 
at  vertex  Si  in  element  A  (a  triangle  in  two- 
dimensional  problems,  a  tetrahedron  in  three- 
dimensional  ones),  and  W  is  the  specified  value 
of  W  at  the  boundaries.  The  second  term 
of  Eq.  (119)  corresponds  exactly  to  the  term 
E;'=(W,x,x)  introduced  in  Section  2.1.  Each 
component  of  this  term  can  be  written  as 

=  /  r{Wi,Wi,x).Vi  da 

JdC{,j(x)  (120) 

=  ^(Wi,Wj,x,x) 

While  various  upwind  algorithms  can  be  used 
for  computing  Roe’s  scheme  [50]  is  chosen 

here.  Following  the  MUSCL  approach  intro¬ 
duced  by  Van  Leer  [51],  second-order  accuracy 
is  achieved  by  computing  the  numerical  fluxes 
at  interpolated  values  of  the  fluid  state  vector 
on  the  interface  between  cells  Ci  and  Cj  as  fol¬ 
lows 


^ij  =  ^{Wij,Wji,x,x) 

Wij  -  Wi  +  ^{'VW)i.vector{SiSj)  (12I) 

Wji  =  Wj  -^iVW)j.vector{SiSj) 

For  three-dimensional  problems,  the  gradient  of 
IT  at  a  vertex  Si  is  computed  from 

V4VW)>  =  XI 

A,S;6A  fc  =  l 

(122) 

In  practice,  the  interpolation  implied  by  Eqs.  (121) 
is  performed  on  the  physical  instead  of  the  the 
conservative  variables.  Optional  limiters  are 
also  implemented  following  the  approaches  dis¬ 
cussed  in  [52,53]. 
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The  numerical  viscous  fluxes  are  computed 
using  a  classical  Galerkin  method. 


4.2.  Temporal  solution 


The  explicit  kernel  of  our  two-dimensional  flow 
solver  uses  the  3-step  Runge-Kutta  algorithm 
discussed  in  Section  3.2.  On  the  other  hand,  the 
explicit  module  of  our  three-dimensional  flow 
solver  employs  the  predictor-corrector  scheme 
suggested  by  Hancock  and  presented  by  Van 
Leer.  This  scheme  has  a  lower  stability  limit 
than  the  3-step  Runge-Kutta  algorithm  but  is 
significantly  more  economical. 


The  implicit  versions  of  our  two-  and 
three-dimensional  flow  solvers  employ  a  first- 
order  accurate  backward  Euler  time-integration 
scheme.  To  solve  the  system  of  linearized  equa¬ 
tions  arising  at  each  time-step,  we  have  recently 
developed  a  multilevel,  overlapping  domain  de¬ 
composition  preconditioned  Krylov-Schwarz  it¬ 
erative  method  [54].  Numerical  experiments 
have  shown  that  this  and  other  members  of  the 
family  of  Krylov-Schwarz  algorithms  are  highly 
scalable  and  highly  parallelizable.  (The  concept 
of  scalability  is  discussed  in  Section  5).  More 
importantly,  the  convergence  of  these  methods 
does  not  degenerate  when  the  linearized  system 
becomes  highly  nonsymmetric  and  possibly  in¬ 
definite,  which  occurs,  for  example,  in  the  case 
of  high-angle  of  attack  and/or  high  Mach  num¬ 
ber. 


4.3.  Parallelization 

The  mesh  partitioning  paradigm  is  used  for 
parallelizing  both  two-dimensional  and  three- 
dimensional  flow  solvers.  This  paradigm  is  dis¬ 
cussed  in  details  in  Section  8. 


5.  THE  STRUCTURAL  DYNAMICS 


ANALYZER  [74.76.34.81.571 


There  is  no  question  that  the  finite  element 
method  is  the  most  popular  method  for  solv¬ 
ing  arbitrary  structural  problems  such  as  those 


governed  by  the  second  of  Eqs.  (22).  However, 
with  the  advent  of  parallel  processing,  many 
of  the  computational  modules  of  this  powerful 
method  are  being  constantly  revisited  for  im¬ 
provement  in  performance. 

Nonlinear  transient  finite  element  prob¬ 
lems  in  structural  mechanics  are  characterized 
by  the  semi-discrete  equations  of  dynamic  equi¬ 
librium 

Mq  +  f”'(q)  =  (123) 


where,  as  before,  M  is  the  mass  matrix,  q  is 
the  vector  of  nodal  displacements,  a  dot  su¬ 
perscript  indicates  a  time  derivative,  is  the 
vector  of  internal  nodal  forces,  and  is  the 
vector  of  external  nodal  forces.  In  many  low 
and  medium  frequency  dynamics  applications 
such  as  transient  aeroelasticity,  Eq.  (123)  is 
most  efficiently  solved  using  an  implicit  time- 
integration  scheme.  In  that  case,  a  nonlinear 
algebraic  system  of  equations  is  generated  at 
each  time-step.  The  Newton-Raphson  method 
and  its  numerous  variants  collectively  known  as 
“Newton-like”  methods  are  the  most  popular 
strategies  for  solving  these  nonlinear  algebraic 
problems.  All  of  these  algorithms  require  the 
solution  of  a  linear  algebraic  system  of  equa¬ 
tions  of  the  form 


where  the  subscript  n  refers  to  the  n-th  time 
step,  the  superscript  k  refers  to  the  k-th  non¬ 
linear  iteration  within  the  current  time  step,  K* 
is  a  time-dependent  symmetric  positive  approx¬ 
imate  tangent  matrix  that  includes  both  mass 
and  stiffness  contributions,  and  and 

r*(q[f_^i)  are  respectively  the  vector  of  nodal 
displacement  increments  and  the  vector  of  out- 
of-balance  nodal  forces  (dynamic  residuals). 


With  the  advent  of  parallel  processing,  do¬ 
main  decomposition  (or  substructure)  based  di¬ 
rect  and  iterative  algorithms  have  become  in¬ 
creasingly  popular  for  the  solution  of  finite  el¬ 
ement  systems  of  equations  of  the  form  given 
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in  Eq.  (123).  Indeed,  domain  decomposition 
provides  a  higher  level  of  concurrency  than  par¬ 
allel  global  algebraic  paradigms,  and  is  simpler 
to  implement  on  most  parallel  computational 
platforms  [57].  In  general,  the  subdomain  (or 
substructure)  equations  are  solved  using  a  di¬ 
rect  skyline  or  sparse  factorization  based  algo¬ 
rithm,  while  both  direct  and  iterative  schemes 
have  been  proposed  for  the  solution  of  the  inter¬ 
face  problem  [58-62].  When  the  reduced  system 
of  equations  is  solved  directly,  the  overall  do¬ 
main  decomposition  algorithm  becomes  a  direct 
frontal  or  multifrontal  method  [63,64],  and  its 
success  becomes  contingent  on  finding  a  good 
mesh  partition  and/or  reordered  system  that 
can  achieve  an  optimal  balance  between  min¬ 
imizing  fill-in  and  increasing  the  degree  of  par¬ 
allelism  [65-68].  When  the  interface  problem 
is  solved  iteratively  —  usually,  via  a  precondi¬ 
tioned  conjugate  gradient  (PCG)  algorithm — 
the  overall  domain  decomposition  method  be¬ 
comes  a  genuine  iterative  solver  whose  success 
hinges  on  two  important  properties:  numerical 
scalability,  and  parallel  scalability.  A  domain 
decomposition  based  iterative  method  is  said  to 
be  numerically  scalable  if  the  condition  number 
K,  after  preconditioning  does  not  grow  or  grows 
“weakly”  with  the  ratio  of  the  subdomain  size 
H  and  the  mesh  size  h  (Fig.  14),  that  is 

=  0{l  +  log^{^))  (125) 

with  a  small  constant  f3.  Numerous  authors 
have  proved  Eq.  (125)  with  /I  =  2  for  various 
domain  decomposition  methods  (see,  for  exam¬ 
ple,  [62,69,70]  and  references  therein). 


It  is  well  known  that  in  order  to  achieve 
(125),  a  domain  decomposition  method  must 
involve  a  coarse  problem  with  a  few  d.o.f.  per 
subdomain,  that  must  be  solved  at  each  itera¬ 
tion  to  propagate  the  error  globally  and, acceler¬ 
ate  convergence.  Parallel  scalability  character¬ 
izes  the  ability  of  an  algorithm  to  deliver  larger 
speedups  for  a  larger  number  of  processors.  In 
particular,  parallel  scalability  is  necessary  for 
massively  parallel  processing. 

The  practical  implications  of  a  condition 
number  after  preconditioning  such  as  that  de¬ 
scribed  in  Eq.  (125)  are 

•  suppose  that  a  given  mesh  is  fixed,  one 
processor  is  assigned  to  every  subdomain, 
and  the  number  of  subdomains  (which 
varies  as  1/H)  is  increased  in  order  to  in¬ 
crease  parallelism.  In  that  case,  h  is  fixed 
and  H  is  decreased.  From  Eq.  (125),  it 
follows  that  the  bound  on  the  condition 
number  decreases  and  therefore  the  num¬ 
ber  of  iterations  for  convergence  is  expected 
to  decrease  with  an  increasing  number  of 
subdomains.  In  particular,  for  a  numeri¬ 
cally  scalable  domain  decomposition  algo¬ 
rithm  characterized  by  Eq.  (125),  increas¬ 
ing  the  number  of  subdomains  decreases 
the  amount  of  work  per  processor  and  per 
iteration,  without  increasing  the  number 
of  iterations  for  convergence. 

•  on  most  distributed  memory  parallel  pro¬ 
cessors,  the  total  amount  of  available 
memory  increases  with  the  number  of  pro¬ 
cessors.  When  solving  a  certain  class  of 
problems  on  such  parallel  hardware,  it  is 
customary  to  define  in  each  processor  a 
constant  subproblem  size,  and  to  increase 
the  total  problem  size  with  the  number 
of  processors.  In  that  case,  h  and  H 
are  decreased,  but  the  ratio  H/h  is  kept 
constant.  From  Eq.  (125),  it  follows 
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that  a  numerically  scalable  domain  decom¬ 
position  algorithm  can  solve  larger  prob¬ 
lems  with  the  same  number  of  iterations 
as  smaller  ones,  simply  by  increasing  the 
number  of  subdomains.  However,  the  pres¬ 
ence  of  the  coarse  problem  may  limit  par¬ 
allel  scalability  for  a  large  number  of  pro¬ 
cessors. 

•  When  H/h  increases,  that  is,  the  number 
of  elements  assigned  to  a  subdomain  in¬ 
creases,  the  condition  number  will  increase 
only  slightly.  Without  this  property,  the 
condition  number  may  be  too  large  to  be 
practical  for  subdomains  of  a  size  that  we 
wish  to  work  with.  If  there  are  only  a  few 
substructures,  the  conjugate  gradient  algo¬ 
rithm  might  still  converge  quickly  for  some 
domain  decomposition  methods  because  of 
the  presence  of  gaps  in  the  spectrum  of 
the  preconditioned  operator;  however,  for 
large  number  of  subdomains,  the  spectrum 
tends  to  fill  in,  and  the  number  of  itera¬ 
tions  tends  to  increase  [32]. 

The  Finite  Element  Tearing  and  Intercon¬ 
necting  (FETI)  method  developed  originally  for 
the  solution  of  self-adjoint  elliptic  partial  dif¬ 
ferential  equations  is  a  numerically  scalable  do¬ 
main  decomposition  method  [60,61,32].  This 
method  was  shown  to  outperform  direct  skyline 
solvers  and  several  popular  iterative  algorithms 
on  both  sequential  and  parallel  computing  plat¬ 
forms  [60,73].  It  has  recently  been  extended 
for  dynamics  problems  [74,33]  and  biharmonic 
partial  differential  equations  such  as  those  en¬ 
countered  in  plate  and  shell  problems  [75].  Eor 
structural  mechanics  problems,  the  condition 
number  of  the  unpreconditioned  FETI  interface 
problem  is  known  to  grow  asymptotically  as  [32] 

K  =  O  (|)  (126) 

As  was  observed  numerically  in  [32,57]  and 
proved  mathematically  in  [70,75],  for  elasticity 
problems  discretized  using  plane  stress/strain 


and/or  brick  elements,  the  condition  number 
of  the  FETI  interface  problem  preconditioned 
with  a  subdomain  based  Dirichlet  operator 
[32,57]  varies  as 

K  =  O  (1  +  log>^  13  <3.  (127) 

For  shell  and  plate  problems,  this  condition 
number  varies  as  [75] 

«  =  0{l  +  log^{^)) 

The  conditioning  results  (126-128)  highlight 
the  numerical  scalability  of  the  FETI  method 
with  respect  to  both  the  mesh  size  h  and  the 
number  of  subdomains.  The  parallel  scalabil¬ 
ity  of  this  domain  decomposition  method  — 
that  is,  its  ability  to  achieve  larger  speedups  for 
larger  number  of  processors  —  has  also  been 
demonstrated  on  current  massively  parallel  pro¬ 
cessors  for  several  realistic  structural  problems 
[57,71,72]. 

The  beauty  of  the  FETI  method  resides 
in  the  fact  that  it  is  much  more  than  an  alge¬ 
braic  solver.  Many  complex  structural  systems 
such  as  airplanes  are  constructed  by  assembling 
a  set  of  substructures  such  as  the  wing,  fuse¬ 
lage,  spars  and  ribs,  tail,  that  are  designed  by 
different  teams  of  engineers.  The  global  be¬ 
havior  of  such  structures  is  often  predicted  by 
“gluing”  together  the  individual  substructure 
analyses.  In  such  cases,  the  submeshes  asso¬ 
ciated  with  the  substructures  may  have  non- 
conforming  discrete  interfaces,  mainly  because: 
(a)  the  corresponding  substructures  can  have 
different  resolution  requirements,  (b)  the  sub¬ 
meshes  are  often  designed  by  different  analysts, 
and  (c)  these  submeshes  may  be  designed  using 
incompatible  finite  element  models.  Whether 
the  substructure  interfaces  are  matching  or  not, 
the  FETI  method  provides  a  powerful  means 
for  solving  such  assembly  problems  [76-78].  In 
essence,  the  FETI  method  is  at  the  same  time 
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a  domain  decomposition  and  a  domain  integra¬ 
tion  method,  and  lends  itself  naturally  to  par¬ 
allelism.  We  overview  it  next  in  the  context  of 
structural  dynamics. 


5.1.  The  Transient  FETI  Method 


Let  Cl  denote  the  volume  of  the  structure  to 
be  analyzed,  and  denote  a  partition¬ 

ing  (tearing)  of  Cl  into  Ns  non-overlapping  sub¬ 
structures  (Fig.  15).  We  denote  by  F/ the  inter¬ 
face  boundary  of  fiL  We  use  an  irreducible  dis¬ 
placement  formulation  inside  the  subdomains, 
and  independently  defined  Lagrange  multipli¬ 
ers  on  the  substructure  interfaces  to  join  them. 


Fig.  15.  Partitioning  of  a  structure 


For  each  substructure,  the  finite  element 
nonlinear  equations  of  dynamic  equilibrium  can 
be  written  as 

-f  f"*'(q)  =  (128) 


where  B®  is  a  boolean  matrix  with  entries  equal 
to  -1,  0,  +1  that  extracts  from  a  substruc¬ 
ture  quantity  those  components  that  are  related 
to  the  interface  boundary  F},  and  A  (not  to 
be  confused  with  its  previous  use  for  the  dy¬ 
namic  pressure  in  Section  1.1)  is  the  vector  of 
Lagrange  multipliers  representing  the  traction 
forces  needed  for  enforcing  on  the  substructure 
interfaces  the  continuity  of  the  displacement 


Using  the  notation  of  Eq.  (124),  the  lineariza¬ 
tion  of  Eqs.  (128,129)  around  q^^^  can  be  for¬ 
mulated  as 


where  K*  denotes  here  the  subdomain  tangent 
stiffness  matrix.  Eqs.  (130)  are  known  as  dif¬ 
ferential/algebraic  equations  (DAEs).  They  are 
more  difficult  to  solve  than  the  usual  ordinary 
differential  equations  [79]. 


5.2.  Implicit  Time-Integration 


Let  ... 

n-t-|  ”+2  "+2 

denote  the  momentum  increment  at  iteration 

k  +  1  and  at  the  midpoint  between  steps  n  and 

n  -f  1,  and  let  M  =  [M^  ...  ].  We  have 


In  [76],  it  was  shown  that  the  following  time- 
integration  algorithm  for  solving  the  DAEs  (13) 
is  second-order  accurate  and  unconditionally 
stable 


0 
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where  F/  and  d  are  given  by 


1.  Solve: 


F/  =  ^  B^K*'  d  =  B^K* 


(M^  +  ^K^)Aq:‘;r 


2.  Update: 


5  =  U  •••5 


E  =  0 


sW  ,  .  ,(*=+1) 

q„+i  +  Aq„^i 

2q:+i-q: 

v;+Atv:^i 


Note  that  the  FETI  domain  decomposition 
method  transforms  the  original  primal  prob¬ 
lem  described  in  Eq.  (124)  into  a  dual  inter¬ 
face  problem.  The  dual  interface  operator  Fi 
is  in  general  symmetric  positive  semi-definite. 
It  has  interesting  spectral  properties  [32,57,74] 
which  induce  a  superconvergent  behavior  of  a 
PCG  algorithm  applied  to  the  solution  of  Eq. 
(135).  The  parallelization  of  a  conjugate  gra¬ 
dient  scheme  applied  to  the  solution  of  the 
dual  interface  problem  is  straight  forward,  be¬ 
cause  Fi  is  a  sum  of  independent  substructure 
operators.  All  CG  related  computations  can 
be  performed  in  parallel  on  a  substructure-by¬ 
substructure  basis. 

5.3.  The  FETI  PCG  Parallel  Algorithm 


The  computational  cost  of  the  above  im¬ 
plicit  time-integration  algorithm  is  dominated 
by  the  cost  incurred  at  each  time  step  for  the 
solution  of  a  constrained  system  of  the  form 

K*^q^  =g^-B^^A  s  =  l,...,iV. 


E  B’*!'  =  0 


where  a  simplified  notation  has  been  used,  and 
K*  is  given  by 


au 

M'*  + 

4 


After  some  algebraic  manipulations,  Eqs.  (133) 
above  can  be  rewritten  as 

F/A  =  d  ri35) 


We  have  developed  two  preconditioners  for 
the  FETI  method:  (1)  a  numerically  optimal 
Dirichlet  preconditioner  that  can  be  written  as 


Ff‘ 


0  k:.  -  K*,  K*; 


where  the  subscripts  i  and  b  designate  here  in¬ 
ternal  and  interface  boundary  unknowns,  re¬ 
spectively,  and  (2)  a  numerically  efficient  “lumped” 
preconditioner  that  lumps  the  Dirichlet  opera¬ 
tor  on  the  subdomain  interface  unknowns 


Eb- 


0  0 
0  Kll 


—£)  1  _ —  1 

Unlike  F/  ,  the  preconditioner  F/  is  not 

mathematically  optimal.  However,  it  is  more 

economical  than  Fj  ,  and  has  often  proved 

to  be  more  efficient  [32,57]. 
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Let  denote  either  the  Dirichlet  or 

lumped  preconditioner,  and  let  G/  denote  the 
matrix  collecting  the  traces  on  the  substructure 
interfaces  of  the  the  rigid  body  modes  R*  of  the 
Nf  floating  substructures  —  that  is,  the 
substructures  without  enough  boundary  condi¬ 
tions  to  prevent  the  local  tangent  stiffness  ma¬ 
trices  K*  from  being  singular 

Gi  =  [B^R^  ...  (139) 

Using  G/,  we  introduce  the  projection  operator 

P  =  I-G/(GJF/G7)-^GJ’F7  (140) 

where  I  denotes  the  identity  matrix.  In  three- 
dimensional  problems,  each  floating  substruc¬ 
ture  can  have  up  to  6  rigid  body  modes.  There¬ 
fore,  G^FrGr  is  a  square  matrix  of  size  equal 
at  most  to  6  X  Nj  <  6  X  Ns  (Q  unknowns  per 
floating  substructure). 

The  transient  FETI  PCG  algorithm  [33] 
for  solving  Eq.  (135)  goes  as  follows 


1.  Initialize 

A°  =  G7  (G7^F7G7)“^  Gfd 
r°  =  d  -  F7A“ 

2.  Iterate  k  =  1,  2,  ...  until  convergence 
Project 

Precond.  ^  =  Fj  ^ 

Project  =  P  z*~^ 

pfc  ^yfc-1 

k  k  —  1^  k  —  1  Z  — 

u  =  y  w  /p  1  /p 

Ak  _  \  *“  1  1  k  k 

=  A  -T  k'  P 

k  k—1  k-rri  ^k 

r  r  —  1/  t  ip 

(141) 


The  application  of  the  projection  operator 
P  defined  in  (140)  means  that  a  coarse  prob¬ 
lem  of  the  form  (G7^F7G7)x  =  b  must  be 
solved  twice  within  each  FETI  iteration.  It  was 
shown  in  [32]  that  this  coarse  problem  has  the 
expected  beneficial  effect  of  coupling  all  sub¬ 
structure  computations  and  propagating  the  er¬ 
ror  globally,  so  that  the  condition  number  of 
the  preconditioned  interface  problem  can  be 
bounded  as  a  function  of  H/h  and  indepen¬ 
dently  of  the  number  of  substructures,  which 
ensures  the  numerical  scalability  of  the  FETI 
method. 

For  shell  and  plate  problems,  the  definition 
of  R*  is  slightly  modified  to  include  not  only 
the  substructure  rigid  body  modes,  but  also  the 
substructure  “corner”  modes  [75].  Otherwise, 
the  remainder  of  the  FETI  algorithm  remains 
essentially  the  same. 

5.4.  Optimization  for  Problems 
with  Multiple /Repeated  R.H.S. 

One  of  the  many  reasons  why  numerical  seal- 
ability  is  desirable  is  that  increasing  the  num¬ 
ber  of  subdomains  is  the  simplest  means  for  in¬ 
creasing  the  degree  of  parallelism  of  a  domain 
decomposition  based  PCG  algorithm.  As  illus¬ 
trated  in  the  previous  paragraph,  this  optimal 
property  is  usually  achieved  via  the  introduc¬ 
tion  in  a  domain  decomposition  method  of  a 
coarse  problem  (or  coarse  grid,  by  analogy  with 
multigrid  methods)  that  relates  to  the  original 
problem  and  that  must  be  solved  at  each  global 
CG  iteration.  Direct  methods  are  often  chosen 
for  solving  the  coarse  problem  despite  the  fact 
that  they  are  difficult  to  implement  on  a  mas¬ 
sively  parallel  processor  and  do  not  parallelize 
as  well  as  iterative  schemes.  Therefore  in  many 
cases,  a  numerically  scalable  domain  decompo¬ 
sition  method  loses  its  appeal  because  of  its  lack 
of  parallel  scalability.  One  way  to  restore  par¬ 
allel  scalability  is  to  solve  iteratively  the  coarse 


8-41 


problem,  for  example  using  a  CG  scheme.  How¬ 
ever,  because  the  coarse  problem  is  embedded 
in  an  outer  iterative  loop,  this  approach  raises 
the  question  of  how  to  solve  iteratively  and  ef¬ 
ficiently  a  system  with  a  constant  matrix  and 
repeated  right-hand  sides.  Finding  an  answer 
to  this  question  also  extends  the  range  of  ap¬ 
plications  of  domain  decomposition  based  iter¬ 
ative  methods  to  design  problems,  eigenvalue 
problems,  and  several  other  applications  where 
multiple  and  repeated  right-hand  sides  always 
arise  and  challenge  iterative  solvers.  Such  ex¬ 
amples  include  nonlinear  transient  aeroelastic 
simulations  where  the  structure  is  assumed  to 
remain  in  the  linear  regime.  In  that  case,  the 
left  hand  side  F/  of  Eq.  (135)  remains  constant 
in  time,  but  its  right  hand  side  (r.h.s.)  d  varies 
in  time. 

The  iterative  solution  of  systems  with 
multiple  and/or  repeated  right-hand  sides  has 
been  previously  addressed  in  [80],  and  recently 
in  [34,81].  Here,  we  overview  the  CG  based 
methodology  for  solving  such  problems  that 
uses  the  same  data  structures  as  those  employed 
in  domain  decomposition  methods  without  a 
coarse  grid  and  which  was  first  presented  in 
[34,81].  The  basic  idea  is  related  to  that  an¬ 
alyzed  in  [80].  However,  the  algorithm  we  have 
developed  is  different,  simpler,  and  easier  to 
parallelize  than  that  described  in  [80]. 

Since  GJF/  appears  twice  in  the  expres¬ 
sion  of  the  projector  P  (140),  we  first  con¬ 
struct  Q  =  F/G/.  Suppose  that  the  solu¬ 
tion  of  the  first  encountered  coarse  problem 
(G^Q)x^  =  has  been  obtained  after 
CG  iterations.  Let  denote  the  space  of 
the  (G'^Q)  -orthogonal  search  directions  gen¬ 
erated  during  these  CG  iterations.  If  ex¬ 
plicit  re-orthogonalization  is  implemented  in 
the  CG  algorithm  [57],  (G^Q)S^  is  guar¬ 

anteed  to  be  a  diagonal  matrix.  In  order  to 
compute  the  solution  of  the  next  coarse  prob¬ 
lem  (G^Q)x^  =  b^,  we  proceed  as  follows 


Step  1.  we  project  the  problem  (G^Q)x^  =  b^ 
onto  and  solve  the  trivial  diagonal 
system  S^^(G^Q)SV^°  =  S^^b^. 
Then,  we  perform  a  matrix-vector 
multiplication  to  obtain  X  =  S  y  . 
In  [34],  it  is  argued  that  x^  is  an 
optimal  startup  value  for  x^  because: 
(a)  it  minimizes  x’^(G^Q)x/2— x^b^ 
over  S^,  and  (b)  it  is  inexpensive 
to  compute.  Note  that  the  non¬ 
zero  entries  of  the  diagonal  matrix 
(G^Q)S^  are  readily  available 
from  the  CG  solution  of  the  previ¬ 
ous  coarse  problem  (G^Q)x^  =  b^. 
Therefore,  these  entries  can  be  stored 
and  need  not  be  re-computed. 

Step  2.  next,  we  apply  the  CG  algorithm  to 
the  solution  of  (G^Q)x^  =  b^  af¬ 
ter  it  has  been  modified  to:  (a)  accept 
x^  as  a  startup  solution,  and  (b)  per¬ 
form  the  explicit  orthogonalization  of 
the  new  search  directions  and  with 
respect  to  (G^Q). 

The  solution  of  all  subsequent  coarse  problems 
is  carried  out  using  the  same  two-step  procedure 
outlined  above.  Essentially,  the  space  of  previ¬ 
ous  search  directions  is  constantly  enriched  with 
the  most  recently  computed  ones,  and  orthogo¬ 
nalization  with  respect  to  (G^Q)  is  always  per¬ 
formed.  The  storage  requirements  associated 
with  this  methodology  are  minimal  because  it 
is  applied  to  coarse  and  therefore  small  prob¬ 
lems  (see  [34]  for  further  details).  Because  full 
precision  is  required  for  the  solution  of  all  coarse 
problems,  the  solution  of  the  first  one  typically 
converges  in  a  number  of  iterations  equal  to  the 
size  of  the  matrix  (G^Q)  —  that  is,  the  total 
number  of  substructure  rigid  body  modes  — 
and  all  subsequent  coarse  problems  are  solved 
in  zero  iteration,  using  only  the  optimal  startup 
value. 

Clearly,  the  methodology  outlined  above 
for  solving  iteratively  and  efficiently  a  system 
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of  equations  with  a  constant  matrix  and  re¬ 
peated  right-hand  sides  is  equally  applicable  to 
any  (symmetric)  system  of  the  form  Ax  =  b 
where  A  is  of  a  relatively  small  size.  In  partic¬ 
ular,  it  is  applicable  to  Eq.  (135)  since  the  size 
of  F/  is  equal  to  the  number  of  interface  d.o.f., 
and  that  number  is  usually  less  than  30%  of 
the  total  number  of  structural  d.o.f.  Therefore, 
this  methodology  can  be  used  for  solving  non¬ 
linear  transient  aeroelasticity  problems  where 
the  structure  remains  in  the  linear  regime.  In 
that  case,  F/  is  the  same  at  all  time-steps,  and 
d  varies  in  time  with  the  pressure  associated 
with  the  unsteady  flow. 

As  an  example,  we  apply  the  methodology 
described  above  to  the  solution  of  the  repeated 
systems  arising  from  the  linear  transient  analy¬ 
sis  using  an  implicit  time-integration  scheme  of 
the  three-dimensional  stiffened  wing  of  a  High 
Speed  Civil  Transport  (HSCT)  aircraft  (Fig. 
16).  The  structure  is  modeled  with  6,204  tri¬ 
angular  shell  elements,  456  beam  elements,  and 
includes  18,900  d.o.f.  The  finite  element  mesh 
is  partitioned  into  32  subdomains  with  excel¬ 
lent  aspect  ratios  using  TOP/DOMDEC  [82]. 
The  size  of  the  interface  problem  is  3,888  — 
that  is,  20.57%  of  the  size  of  the  global  prob¬ 
lem.  The  transient  analysis  is  carried  out  on  a 
32-processor  iPSC-860  system.  After  all  of  the 
usual  finite  element  storage  requirements  are  al¬ 
located,  there  is  enough  memory  left  to  store 
a  total  number  of  360  search  directions.  This 
number  corresponds  to  9.25  %  of  the  size  of  the 
interface  problem. 


Fig.  16.  HSCT  stiffened  wing 


Using  the  transient  FETI  method,  the  sys¬ 
tem  of  equations  arising  at  the  first  time  step  is 
solved  in  30  iterations  and  7.75  seconds  CPU. 
After  5  time  steps,  89  search  directions  are  ac¬ 
cumulated  and  only  10  iterations  are  needed 
for  solving  the  fifth  linear  system  of  equations 
(Fig.  17).  After  45  time  steps,  the  total  number 
of  accumulated  search  directions  is  only  302  — 
that  is,  only  7.76%  of  the  size  of  the  interface 
problem,  and  superconvergence  is  triggered;  all 
subsequent  time  steps  are  solved  in  2  or  3  it¬ 
erations  (Fig.  17)  and  in  less  than  0.78  second 
CPU  (Fig.  18). 

When  a  parallel  skyline  direct  solver  is  ap¬ 
plied  to  the  above  problem,  the  factorization 
phase  consumes  60.5  seconds  CPU,  and  at  each 
time  step  the  pair  of  forward/backward  sub¬ 
stitutions  requires  10.65  seconds  on  the  same 
32  processor  iPSC-860.  Therefore,  the  solution 
methodology  described  herein  is  clearly  an  ex¬ 
cellent  alternative  to  repeated  forward/ backward 
substitutions  on  distributed  memory  parallel 
processors. 


Seconds  CPU  Number  of  Iterations 
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Implicit  Transient  Aeroelasticity  via  FETI 


Implicit  Transient  Aeroelasticity 


6.  NON-MATCHING  INTERFACE 
BOUNDARIES  [35] 

All  four  partitioned  analysis  procedures  dis¬ 
cussed  in  Section  3  require  exchanging  interface- 
data  only  between  the  field  analyzers.  More 


precisely,  the  structure  expects  to  receive  the 
values  of  the  flow  pressure  and  viscous  stresses 
at  the  fluid/structure  interface  boundary  T p/Si 
and  convert  them  into  a  structural  load.  Simi¬ 
larly,  the  fluid  expects  to  receive  from  the  struc¬ 
ture  the  displacement  and/or  velocity  of  the  in¬ 
terface  boundary  Tp/si  and  use  them  to  update 
the  position  of  the  dynamic  fluid  mesh.  This 
exchange  is  performed  at  every  time-step,  or  as 
required  by  the  subcycling  algorithm. 

In  general,  the  fluid  and  structure  meshes 
have  two  independent  representations  of  the 
physical  fluid/structure  interface.  When  these 
representations  are  identical  —  for  example, 
when  every  fluid  grid  point  on  Tp/s  is  also  a 
structural  node  and  vice-versa  —  the  evalua¬ 
tion  of  the  pressure  forces  and  the  transfer  of 
the  structural  motion  to  the  fluid  mesh  are  triv¬ 
ial  operations.  However,  analysts  usually  prefer 
to 

•  use  a  fluid  mesh  and  a  structural  model 
that  have  been  independently  designed 
and  validated. 

•  refine  each  mesh  independently  from  the 
other. 

Hence,  most  realistic  aeroelastic  simulations 
will  involve  handling  fluid  and  structural  meshes 
that  are  incompatible  at  their  interface  bound¬ 
aries  (Fig.  19).  In  [35],  we  have  addressed  this 
issue  and  proposed  a  preprocessing  “matching” 
procedure  that  does  not  introduce  any  other  ap¬ 
proximation  than  those  intrinsic  to  the  fluid  and 
structure  solution  methods.  This  procedure  can 
be  summarized  as  follows. 
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Fig.  19.  Incompatible  fluid  and  structure  meshes 


The  nodal  forces  induced  by  the  fluid  pres¬ 
sure  on  the  “wet”  surface  of  a  structural  ele¬ 
ment  e  can  be  written  as: 

=  f  Ni(-pu  +  {Tu)i>)dcr  (142) 

where  denotes  the  geometrical  support  of 
the  wet  surface  of  the  structural  element  e,  p 
is  the  pressure  field,  r  is  the  tensor  of  viscous 
stresses,  u  is  the  unit  normal  to  is  a 

tangent  to  the  plane  of  and  Ni  is  the  shape 
function  associated  with  node  i  in  element  e. 
Most  if  not  all  finite  element  structural  analysis 
codes  evaluate  the  integral  in  Eq.  (142)  via  a 
quadrature  rule 

g=ng 

/r'  =  Y.  ^,mX,){-p(X.)l,  +  ir(XMP) 

5  =  1 

(143) 

where  Wg  is  the  weight  of  the  Gauss  point  Xg. 
Hence,  a  structural  analysis  code  needs  to  know 
only  the  values  of  the  pressure  field  at  the  Gauss 
points  of  its  wet  surface.  This  information  can 
be  easily  made  available  once  every  Gauss  point 
of  a  wet  structural  element  is  paired  with  a 
fluid  cell  (Fig.  20).  It  should  be  noted  that 
in  Eq.  (143),  Xg  are  not  necessarily  the  same 


Gauss  points  as  those  used  for  stiffness  evalua¬ 
tion.  For  example,  if  a  high  pressure  gradient  is 
anticipated  over  a  certain  wet  area  of  the  struc¬ 
ture,  a  larger  number  of  Gauss  points'  can  be 
used  for  the  evaluation  of  the  pressure  forces 
on  that  area. 

On  the  other  hand,  when  the  structure 
moves  and/or  deforms,  the  motion  of  the  fluid 
grid  points  on  Tp/s  can  be  prescribed  via  the 
regular  finite  element  interpolation 

k—wne 

45,)  =  Y  (144) 

fc  =  l 

where  tone,  Ay,  and  qk  denote  respectively 
a  fluid  grid  point  on  E^/g,  the  number  of  wet 
nodes  in  its  “nearest”  structural  element  e, 
the  natural  coordinates  of  Sj  in  and  the 

structural  displacement  at  the  k-th  node  of  el¬ 
ement  e.  From  Eq.  (144),  it  follows  that  each 
fluid  grid  point  on  Tp/s  must  be  matched  with 
one  wet  structural  element  (Fig.  21). 


Fig.  20.  Gauss-point — fluid  cell  pairing 
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Fig.  21.  Fluid  grid  point — wet  structural 
element  pairing 


Given  a  fluid  grid,  a  structural  analy¬ 
sis  model,  and  a  discrete  description  of  the 
fluid/structure  interface,  the  Matcher  program 
described  in  [35]  generates  all  the  data  struc¬ 
tures  needed  for  evaluating  the  quantities  de¬ 
scribed  in  Eqs.  (143,144).  If  parallel  data 
structures  —  for  example,  data  structures  as¬ 
sociated  with  mesh  partitions  of  the  fluid  and 
structure  grids  (see  Section  8)  —  are  fed  as  in¬ 
put,  Matcher  outputs  parallel  data  structures 
that  allow  a  painless  implementation  of  the 
interface-data  exchange  between  the  field  an¬ 
alyzers  and  are  fully  compatible  with  the  in¬ 
trinsic  parallel  data  structures  of  the  fluid  and 
structural  analysis  programs.  In  general,  the 
pairing  of  fluid  and  structure  entities  does  not 
change  in  time.  Therefore,  Matcher  is  run  as  a 
preprocessor.  The  pairing  data  structures  are 
generated  only  once,  prior  to  any  aeroelastic 
computation. 

Finally,  we  note  that  Matcher  is  written 
in  a  message-passing  style.  Therefore,  this  soft¬ 
ware  is  portable  to  any  parallel  computing  plat¬ 
form  that  supports  a  PVM-  or  MPI-like  com¬ 
munication  library.  Of  course,  it  also  runs  on 
sequential  computers.  For  a  complete  aircraft 
conflguration  where  the  fluid  mesh  contained 
439272  tetrahedra,  77279  vertices,  and  was  par¬ 
titioned  into  32  subdomains,  and  the  struc¬ 
tural  model  contained  7520  triangular  shell  el¬ 
ements,  3841  nodes,  and  was  partitioned  into 


16  subdomains.  Matcher  generated  the  desired 
fluid/structure  pairing  data  structures  in  327 
seconds  CPU  on  a  32-processor  iPSC-860  sys¬ 
tem  [35]. 

7.  THE  MESH  MOTION  SOLVER  [441 

At  the  beginning  of  each  step  of  the  chosen  stag¬ 
gered  solution  procedure,  the  dynamic  fluid  grid 
must  be  updated  to  conform  to  the  most  re¬ 
cently  computed  configuration  of  the  structure. 
In  general,  this  is  done  in  two  steps 

Step  1.  first,  the  points  lying  on  the  inter¬ 
face  boundary  Ff/s  are  adjusted  to 
match  (in  the  sense  defined  in  Sec¬ 
tion  6)  the  newly  computed  or  pre¬ 
dicted  position  of  the  surface  of  the 
structure.  This  defines  a  prescribed 
displacement  vector 

Step  2.  next,  the  remainder  of  the  fluid  grid 
points  are  repositioned  accordingly 
to  the  prescribed  values  of 
This  completes  the  computation  of 
the  new  mesh  displacement  vector  x. 

Several  procedures  have  been  proposed  in 
the  literature  for  implementing  the  above  two 
steps.  Most  of  them  can  be  summarized  as 
viewing  the  fluid  domain  or  its  grid  as  a  pseudo- 
structural  continuous  or  discrete  system.  For 
example,  in  the  discrete  approach,  either  or  all 
of  the  following  can  be  done  (see  Fig.  3) 

•  lumping  a  fictitious  mass  at  each  vertex  of 
the  fluid  mesh. 

•  introducing  a  fictitious  dashpot  at  each 
edge  connecting  two  vertices. 

•  attaching  a  fictitious  spring  on  each  edge 
connecting  two  vertices. 

Similarly,  a  pseudo-structural  continuous  sys¬ 
tem  can  be  generated  with  fictitious  distributed 
structural  properties.  In  both  cases,  the  motion 
of  the  constructed  pseudo-structural  system  is 
governed  by 

Mx  -1-  Dx  +  Kx  =  KcXrj,/5 


(145) 


where  M,  D,  and  K  are  the  fictitious  mass, 
damping,  and  stiffness  matrices  associated  with 
the  dynamic  fluid  mesh,  and  Kc  is  the  compo¬ 
nent  of  the  fictitious  stiffness  matrix  that  repre¬ 
sents  the  coupling  between  the  fluid  points  lying 
on  Tp/s  and  the  others.  Eq.  (126)  above  is  in¬ 
tegrated  in  time  until  the  steady-state  equilib¬ 
rium  displacement  x  is  reached.  This  solution 
procedure  can  be  speeded  up  by  constructing 
M  and  D  as  follows 

M  =  ciK,  D  =  6K  (146) 


The  grid  points  located  on  the  upstream  and 
downstream  boundaries  are  held  fixed.  At  each 
time-step  ,  the  new  position  of  the  inte¬ 
rior  grid  points  is  determined  from  the  solu¬ 
tion  of  Eq.  (147)  via  a  two-step  iterative  pro¬ 
cedure.  Eirst,  the  displacements  of  the  inte¬ 
rior  grid  points  are  predicted  by  extrapolating 
the  previous  displacements  at  time-steps  t"  and 
in  the  following  manner 

Ax  =  2A"x-A"“^x  (149) 


and  selecting  the  two  scalars  a  and  b  so  that  the 
governing  equations  of  motion  (126)  are  criti¬ 
cally  damped.  In  that  case,  the  equilibrium  so¬ 
lution  X  is  reached  in  a  few  time-steps. 

Alternatively,  M  and  D  can  be  set  to  zero, 
and  the  new  position  of  the  dynamic  fluid  mesh 
can  be  computed  via  the  solution  of  the  static 
problem 

Kx  =  KcXFf/s  (147) 

This  approach  is  often  referred  to  as  Batina’s 
network  of  edge-springs  [15].  However,  it  should 
be  noted  that  attaching  a  lineal  spring  on  the 
edge  connecting  two  vertices  of  a  tetrahedron 
prevents  these  two  vertices  from  colliding  dur¬ 
ing  the  mesh  deformation,  but  does  not  pre¬ 
vent  a  vertex  from  interpenetrating  the  facet 
of  a  tetrahedron.  To  prevent  such  a  detrimen¬ 
tal  interpenetration  that  is  more  likely  to  hap¬ 
pen  when  the  structure  undergoes  large  mo¬ 
tions,  torsional  springs  must  also  be  added  at 
the  mesh  vertices,  and  their  stiffnesses  must  be 
carefully  calibrated. 

In  this  work,  the  pseudo-structural  system 
associated  with  the  unstructured  dynamic  fluid 
mesh  is  constructed  with  lineal  and  torsional 
springs  only  (M  =  D  =  0).  Each  fictitious  lin¬ 
eal  spring  attached  to  an  each  edge  connecting 
two  fluid  grid  points  Si  and  Sj  is  attributed  the 
following  stiffness 

k  -  ^ 


where  A"x  =  —  x”.  Next,  the  predicted 

values  are  corrected  with  a  few  explicit  Jacobi 
relaxations  as  follows 

A"+^Xi  =  -  (150) 

j 

Finally,  the  position  of  the  fluid  grid  points  at 
is  computed  from 

x"+^  =  x"  +  A"+'x  (151) 

8.  A  UNIFIED  PARALLELIZATION 


In  addition  to  numerical  efficiency  and  paral¬ 
lel  scalability,  portability  should  be  a  major 
concern,  especially  for  production  codes.  With 
the  proliferation  of  computer  architectures,  it  is 
essential  to  adopt  a  programming  model  that 
does  not  require  rewriting  thousands  of  lines 
of  code  —  or  even  worse,  altering  the  archi¬ 
tectural  foundations  of  a  code  —  every  time  a 
new  parallel  processor  is  acquired.  Here,  we 
are  neither  referring  to  differences  between  pro¬ 
gramming  languages,  nor  to  differences  between 
the  multitude  of  parallel  extensions  to  a  specific 


8-47 


programming  language.  We  are  more  concerned 
about  the  impact  of  a  given  parallel  hardware 
architecture  on  the  finite  element  software  de¬ 
sign,  and  sometimes,  on  the  solution  algorithm 
itself.  For  example,  a  data  parallel  code  written 
for  the  CM-2  or  CM-5  machines  would  require 
major  rehauling  before  it  can  be  adapted  to  an 
iPSC  computer.  A  parallel-do-loop  based  code 
can  be  easily  ported  across  different  true  shared 
memory  multiprocessors,  but  may  require  sub¬ 
stantial  modifications  before  it  can  run  success¬ 
fully  on  some  distributed  memory  systems. 

Based  on  our  “hands  on”  experience  with 
a  dozen  of  different  parallel  processors,  we  can 
argue  that  the  mesh  partitioning  and  message¬ 
passing  paradigms  lead  to  the  most  portable 
software  design  for  parallel  computational  me¬ 
chanics.  Essentially,  the  underlying  mesh  is  as¬ 
sumed  to  be  partitioned  into  several  submeshes, 
each  defining  a  subdomain.  The  same  “old” 
serial  code  can  be  executed  within  every  sub- 
domain.  The  “gluing”  or  assembly  of  the  sub- 
domain  results  can  be  implemented  in  a  sepa¬ 
rate  software  module.  Clearly,  this  is  an  object- 
oriented  approach  that  is  best  programmed  in 
C-|--(-,  but  which  can  also  be  programmed  in 
FORTRAN  or  any  other  language.  This  ap¬ 
proach  enforces  data  locality  and  therefore  is 
suitable  for  all  parallel  hardware  architectures. 
Note  that  in  this  context,  “message-passing” 
refers  to  the  assembly  phase  of  the  subdomain 
results.  However,  it  does  not  imply  that  mes¬ 
sages  have  to  be  explicitly  exchanged  between 
the  subdomains.  For  example,  message-passing 
can  be  implemented  on  a  shared  memory  multi¬ 
processor  as  a  simple  access  to  a  shared  buffer, 
or  as  a  duplication  of  one  buffer  into  another 
one.  Moreover,  the  message-passing  program¬ 
ming  model  produces  software  modules  that  are 
easy  to  maintain,  because  except  for  the  gluing 
procedures,  the  subdomain  code  can  be  made 
identical  to  that  of  a  workstation  version. 

In  many  cases,  expensive  parallel  proces¬ 
sors  are  affordable  because  some  simulations 


can  substitute  for  experimental  studies  that 
would  take  much  longer  and  cost  much  more  to 
carry  out.  However,  there  are  also  other  cases 
where  current  parallel  processors  are  simply  too 
expensive,  so  that  a  network  of  relatively  inex¬ 
pensive  workstations  is  preferred.  Obviously,  a 
message-passing  based  software  can  be  quickly 
adapted  to  a  cluster  of  workstations,  for  exam¬ 
ple,  using  a  PVM-like  communication  tool. 

Therefore,  all  of  our  flow  solvers,  struc¬ 
tural  analyzers,  and  mesh  motion  solvers  are 
designed  to  work  with  mesh  partitions,  and 
are  written  in  a  message-passing  style.  Conse¬ 
quently,  their  performance  is  not  only  machine 
dependent,  but  sometimes  also  mesh  partition 
dependent. 

Research  in  mesh  partitioning  has  focused 
so  far  on  the  automatic  generation  of  sub- 
domains  with  minimum  interface  points.  In 
this  section,  we  address  this  issue  and  em¬ 
phasize  other  aspects  of  the  partitioning  prob¬ 
lem  including  the  fast  generation  of  large-scale 
mesh  decompositions  on  conventional  worksta¬ 
tions,  the  optimization  of  initial  decomposi¬ 
tions  for  specific  kernels  such  as  parallel  frontal 
solvers  and  domain  decomposition  based  iter¬ 
ative  methods.  More  specifically,  we  overview 
a  two-step  partitioning  paradigm  for  tailoring 
generated  mesh  partitions  to  specific  applica¬ 
tions,  and  a  simple  mesh  contraction  proce¬ 
dure  for  speeding  up  the  optimization  of  ini¬ 
tial  mesh  decompositions.  We  discuss  what  de¬ 
fines  a  good  mesh  partition  for  a  given  problem, 
and  show  that  the  methodology  summarized 
here  can  produce  better  mesh  partitions  than 
the  well  celebrated  multilevel  Recursive  Spec¬ 
tral  Bisection  algorithm,  and  yet  be  an  order  of 
magnitude  faster.  We  illustrate  the  combined 
two-step  partitioning  and  contraction  method¬ 
ology  with  examples  from  structural  mechanics 
and  fluid  dynamics  problems,  and  highlight  its 
impact  on  the  total  solution  time  of  realistic  ap¬ 
plications  on  current  massively  parallel  proces¬ 
sors.  In  particular,  we  show  that  the  minimum 


8-48 


interface  size  criteria  does  not  have  a  significant 
impact  on  a  reasonably  well  parallelized  appli¬ 
cation,  and  highlight  other  criterion  which  can 
have  a  significant  impact. 

8.2.  The  Greedy  and  RSB  Algorithms; 
Two  Extremes 

It  is  often  argued  and  demonstrated  that  if 
unstructured  computational  mechanics  prob¬ 
lems  are  to  be  efficiently  solved  on  distributed- 
memory  parallel  computers,  their  data  struc¬ 
tures  must  be  partitioned  and  distributed  across 
the  processors  in  a  way  that  maximizes  load  bal¬ 
ance  and  minimizes  interprocessor  communica¬ 
tion  [46,83].  However,  research  in  mesh  parti¬ 
tioning  algorithms  has  mostly  focused  on  the 
second  issue  —  that  is,  on  minimizing  interpro¬ 
cessor  communication  costs  only,  and  the  num¬ 
ber  of  interface  points  in  a  mesh  partition,  or 
the  number  of  edge  cuts  in  its  corresponding 
graph,  has  rapidly  become  the  main  “accep¬ 
tance  test”  for  a  proposed  mesh  decomposer. 

While  several  mesh  partitioning  algorithms 
have  already  been  presented  and/or  discussed 
in  the  literature  [83-88],  two  radically  dififerent 
schemes  have  particularly  attracted  the  atten¬ 
tion  of  the  user  and  developer  communities:  the 
Greedy  algorithm  [57,84,85],  and  the  Recursive 
Spectral  Bisection  algorithm  [83,88,93]. 

The  Greedy  (GR)  mesh  partitioning  algo¬ 
rithm  was  first  proposed  in  [89]  and  applied  to 
the  parallel  solution  of  finite  element  structural 
equations  on  the  iPSC-1  system.  This  mesh  de¬ 
composition  scheme  is  referred  to  as  the  Greedy 
algorithm  because  it  essentially  “bites”  into  the 
mesh  in  order  to  construct  the  subdomains.  It 
exploits  only  the  mesh  connectivity  informa¬ 
tion,  which  makes  it  the  fastest  partitioning  al¬ 
gorithm  we  know  about.  In  general,  the  GR  al¬ 
gorithm  tends  to  generate  mesh  partitions  that 
are  characterized  by  reasonable  subdomain  as¬ 
pect  ratios  and  a  relatively  small  number  of 
interface  points.  On  a  few  occasions,  this  al¬ 
gorithm  has  been  misrepresented  [90],  perhaps, 


because  one  statement  is  unfortunately  missing 
in  the  Fortran  code  given  in  [84].  This  state¬ 
ment  is  the  one  which  forces  every  subdomain  to 
start  with  an  element  attached  to  the  previously 
computed  interface.  The  GR  algorithm  enjoys 
a  relatively  large  user  community  because  of  its 
high  performance/price  ratio.  For  example,  it 
is  capable  of  partitioning  a  three-dimensional 
unstructured  mesh  containing  439272  tetrahe- 
dra  and  77279  vertices  into  64  subdomains  with 
25906  interface  points,  in  less  than  15  seconds 
on  a  Crimson  Silicon  Graphics  workstation.  Re¬ 
cently,  some  interesting  variants  of  the  basic  GR 
algorithm  have  been  proposed  [91,92]. 

The  Recursive  Spectral  Bisection  (RSB) 
graph  partitioning  algorithm  was  first  proposed 
in  [88].  This  scheme  is  at  the  same  time  the 
least  intuitive  mesh  decomposer,  and  the  parti¬ 
tioning  algorithm  that  has  most  attracted  the 
attention  of  the  parallel  computing  community. 
Unlike  the  Greedy  algorithm  which  is  simple 
and  has  no  underlying  theory,  the  RSB  scheme 
is  sophisticated  and  relies  on  a  relatively  well 
understood  graph  theory.  More  specifically,  the 
RSB  algorithm  is  derived  from  a  graph  bisec¬ 
tion  strategy  based  on  the  computation  of  the 
Fiedler  vector  —  that  is,  the  second  eigenvec¬ 
tor  of  the  Laplacian  matrix  of  the  graph  asso¬ 
ciated  with  the  given  problem  [88].  Thanks  to 
the  multilevel  strategy  described  in  [93]  for  ex¬ 
tracting  the  Fiedler  vector,  the  computational 
requirements  of  this  partitioning  scheme  are  no 
longer  overwhelming,  even  on  a  simple  work¬ 
station.  However,  the  multilevel  RSB  algo¬ 
rithm  is  still  more  expensive  than  most  other 
partitioning  schemes.  For  example,  when  ap¬ 
plied  to  the  decomposition  into  64  subdomains 
of  the  same  three-dimensional  mesh  containing 
439272  tetrahedra  and  77279  vertices,  it  con¬ 
sumes  707  seconds  on  a  Crimson  Silicon  Graph¬ 
ics  workstation  and  generates  a  mesh  partition 
with  21139  interface  points.  This  mesh  parti¬ 
tion  has  18.40%  less  interface  points  than  the 
decomposition  generated  by  the  Greedy  algo¬ 
rithm,  but  costs  48.07  times  more  CPU  time  to 
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generate.  Depending  on  the  target  parallel  ap¬ 
plication,  such  an  improvement  at  such  a  price 
may  or  may  not  be  interesting.  Recently,  a  par¬ 
allel  version  of  the  RSB  algorithm  has  been  im¬ 
plemented  on  the  CM-5  [94].  This  version  has 
certainly  improved  the  computational  feasibil¬ 
ity  of  the  RSB  partitioner. 

REMARK  8:  Throughout  this  section,  RSB 
designates  the  multilevel  Recursive  Spectral  Bi¬ 
section  algorithm.  In  particular,  all  perfor¬ 
mance  results  reported  for  RSB  applications 
correspond  to  the  fast  multilevel  scheme  de¬ 
scribed  in  [93],  and  release  2.1  of  the  code  as 
integrated  in  TOP/DOMDEC  [82]. 

Minimizing  interprocessor  communication 
costs  in  general,  and  the  number  of  interface 
points  in  a  mesh  partition  in  particular,  is  a 
reasonable  objective  to  “prioritize”  when  the 
target  parallel  application: 

a)  involves  communication  essentially  between 
neighboring  subdomains.  This  is  typi¬ 
cally  the  case  for  explicit  time-integration 
(or  pseudo  time-integration)  schemes,  and 
some  basic  iterative  solvers  such  as  the 
conjugate  gradient  or  Jacobi  precondi¬ 
tioned  conjugate  gradient  methods. 

b)  has  a  computational  complexity  that  can 
be  simply  related  to  mesh  entities  such  as, 
for  example,  nodes,  and/or  edges,  and/or 
elements,  and/or  cells.  In  that  case,  load 
balancing  can  be  reasonably  well  achieved 
by  requiring  that  each  subdomain  contain 
the  same  number  of  such  entities.  In  the 
event  of  heterogeneous  meshes,  a  weight¬ 
ing  factor  can  be  attributed  to  each  ba¬ 
sic  entity  and  the  number  of  mesh  entities 
per  subdomain  can  be  modified  accord¬ 
ingly.  Most  importantly,  load  balancing 
in  that  case  does  not  significantly  interfere 
neither  with  the  minimum  edge  cut  aspect 
of  a  graph  partitioner,  nor  with  the  prac¬ 
tical  implementation  of  the  corresponding 
mesh  decomposer. 


c)  uses  a  solution  methodology  whose  perfor¬ 
mance  is  insensitive  to  the  characteristics 
of  a  mesh  partition  such  as,  for  example, 
the  subdomain  aspect  ratio  or  the  subdo¬ 
main  interconnectivity. 

It  is  our  experience  that  when  conditions 
a),  b)  and  c)  are  met,  the  GR  and  RSB  al¬ 
gorithms  generate  excellent  mesh  partitions  for 
parallel  processing.  Therefore,  we  have  consis¬ 
tently  used  both  algorithms  for  the  subset  of 
our  parallel  applications  that  can  be  described 
by  the  above  a),  b)  and  c)  points. 

However,  many  important  parallel  appli¬ 
cations  do  not  fit  the  profile  implied  by  the 
a),  b)  and  c)  points.  For  example,  frontal 
sparse  solvers  which  are  popular  in  finite  ele¬ 
ments  and  structural  mechanics  [63-66,95]  re¬ 
quire  mesh  partitions  that  do  not  significantly 
inflate  the  operation  count  of  their  sequential 
counterparts.  This  particular  issue  relates  more 
to  the  subdomain  local  frontwidths  than  to  the 
subdomain  interface  sizes.  Moreover,  control¬ 
ling  the  load  balance  of  these  direct  solvers  is 
not  in  general  as  simple  as  distributing  equally 
some  basic  mesh  entities  across  the  desired  sub- 
domains. 

Optimal  domain  decomposition  based  it¬ 
erative  solvers  are  another  class  of  parallel  ap¬ 
plications  whose  scalability  is  not  governed  by 
interprocessor  communication  costs  only  [57]. 
These  solution  algorithms  are  interesting  on 
massively  parallel  processors  when  their  num¬ 
ber  of  iterations  for  convergence  does  not  grow 
(or  grows  weakly)  with  the  number  of  sub- 
domains.  Their  effectiveness  is  determined 
by  their  convergence  rate  and  not  by  their 
amount  of  communication.  In  particular,  op¬ 
timal  non-overlapping  domain  decomposition 
based  iterative  solvers  require  mesh  partitions 
that  have  as  perfect  subdomain  aspect  ratios 
(close  to  unity)  as  possible.  Sometimes,  fulfill¬ 
ing  this  requirement  leads  to  mesh  partitions 
with  larger  interfaces  than  otherwise  possible. 
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This  is  well  demonstrated  below  for  the  struc¬ 
tural  High  Speed  Civil  Transport  wing  finite  el¬ 
ement  model  containing  3150  nodes.  For  this 
problem,  the  32-subdomain  mesh  partition  gen¬ 
erated  by  the  RSB  algorithm  has  707  interface 
nodes  and  an  average  subdomain  aspect  ratio 
AR  =  0.39  (Fig.  22).  The  32-subdomain  mesh 
partition  generated  by  the  methodology  de¬ 
scribed  in  this  section  and  shown  in  Fig.  23  has 
808  interface  nodes,  but  an  average  subdomain 
aspect  ratio  AR  —  0.62.  When  the  FETI  do¬ 
main  decomposition  based  iterative  solver  pre¬ 
sented  in  Section  5  is  applied  to  the  structural 
wing  problem,  it  converges  in  47  iterations  and 
11.93  seconds  when  using  the  RSB  mesh  parti¬ 
tion  on  a  32-processor  iPSC-860.  On  the  other 
hand,  it  converges  in  30  iterations  and  7.75  sec¬ 
onds  when  using  the  mesh  partition  with  larger 
interface  but  improved  subdomain  aspect  ratio 
[96].  Hence,  one  should  question  whether  the 
minimum  interface  size  is  not  after  all  an  over¬ 
emphasized  mesh  partitioning  criterion. 


Fig.  22.  32-subdomain  mesh  partition  for 
an  HSCT  wing  structural  model  (RSB) 


Fig.  23.  32-subdomain  mesh  partition 
(optimized  subdomain  aspect  ratio) 

8.3.  Nomenclature 

Throughout  this  section,  the  following  nomen¬ 
clature  is  used: 

E  set  of  edges  of  the  dual  graph  of  the 

mesh 

P  partitioning  vector:  Pi  —  k  means 

that  the  mesh  entity  i  belongs  to  sub- 
domain  k. 

C  cost  function  to  be  optimized 

LBF  load  balance  factor 

L  computational  load  of  a  given  appli¬ 

cation 

Np  number  of  processors 

Ns  number  of  subdomains 

Ne  number  of  elements  in  the  mesh 

Nn  number  of  nodes  in  the  mesh 

Ni  number  of  degrees  of  freedom  in  the 

model 

number  of  some  specific  mesh  entities 
in  subdomain  k  (including  its  inter¬ 
face  boundary) 

Nme  number  of  macro  elements  in  a  con¬ 
tracted  mesh 

Ni  number  of  interface  points  of  a  mesh 

partition 

jqbest,k  number  of  mesh  entities  in  subdo¬ 
main  k  that  yields  an  optimal  load 
balance 


d 

^ij  k 


Load  imbalance:  C2 
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spatial  dimension  of  the  problem 
2-th  coordinate  of  the  j-th  node  in 
subdomain  k 
Xik  2-th  coordinate  of  the  center  of  grav¬ 

ity  of  subdomain  k 


k-N^ 
k=l 

When  a  parallel  application  has  a  computa¬ 
tional  complexity  that  can  be  simply  related 
to  mesh  entities  such  as,  for  example,  nodes, 
and/or  edges,  and/or  elements,  and/or  cells. 


8.4.  Two-Step  Partitionine:  and  Retrofittin^he  computational  load  L  can  be  easily  esti- 


For  all  our  computational  mechanics  parallel 
applications,  we  have  adopted  the  two-step 
mesh  partitioning  paradigm  that  was  first  in¬ 
troduced  in  [97,98],  then  refined  in  [99],  and 
which  consists  in 


mated  and  can  be  set  prior  to  the  de¬ 
composition  to  =  L/Np.  Otherwise, 

jV^6esi,fc  jg  unknown  a  priori.  It  can  have  a  dif¬ 
ferent  value  in  every  subdomain  k,  and  is  adap¬ 
tively  evaluated  by  the  optimization  algorithm. 


Step  1)  generating  an  initial  mesh  decompo¬ 
sition  via  either  a  suboptimal  but  fast 
partitioning  algorithm,  or  an  algo¬ 
rithm  that  is  known  to  produce  mesh 
partitions  that  are  reasonably  well 
suited  for  the  target  parallel  applica¬ 
tion. 


Subdomain  aspect  ratio: 

k=Ns  i=dj=Nn 

^3  =  X  (X  X  ~  Xikf)-  This  cost 

k=zl  izrl  j  =  l 

function  has  been  shown  to  play  a  pivotal  role  in 
the  convergence  rate  of  optimal  domain  decom¬ 
position  based  preconditioned  conjugate  gradi¬ 
ent  methods  [96]. 


Step  2)  formulating  the  application  specific  In  practice,  the  performance  of  a  parallel 

requirements  as  a  cost  function  C,  application  is  often  governed  by  several  distinct 
and  optimizing  it  by  readjusting  the  factors.  Therefore,  one  should  consider  in  gen- 
initial  subdomain  interfaces.  This  eral  the  following  weighted  cost  function: 
step  can  also  be  described  as  a  retrofitting 

procedure.  C  —  aiCi  (152) 


The  Greedy  algorithm  is  very  fast  because 
its  complexity  grows  as  0  {Ne  X  Ns).  More¬ 
over,  it  produces  mesh  partitions  that  are  rea¬ 
sonably  well-suited  for  most  parallel  computa¬ 
tional  methods.  Hence,  the  GR  algorithm  is 
ideally  suited  for  generating  an  initial  decom¬ 
position  in  Step  1. 

In  Step  2,  a  cost  function  representing  the 
decomposition  requirements  of  the  target  par¬ 
allel  application  must  first  be  formulated.  A 
sample  list  of  cost  functions  to  optimize  is  given 
below: 

Interface  size:  Ci  =  ^[{(2,/)  E  E/Pi  7^ 
Pj}\.  Here,  the  size  of  the  interface  is  defined  as 
the  number  of  edges  in  E  whose  vertices  belong 
to  two  different  subdomains.  This  cost  function 
may  not  govern  all  parallel  applications  but  is 
certainly  helpful  in  all  cases. 


where  Ci  is  a  cost  function  representing  one 
specific  issue  —  for  example,  Ci  could  be  any¬ 
one  of  the  cost  functions  listed  above  —  and 
ttj-  is  the  weight  attributed  to  that  issue.  In 
that  case,  optimizing  C  corresponds  to  finding 
the  best  possible  “compromise”  mesh  partition. 
Unfortunately,  we  do  not  have  yet  an  automatic 
mechanism  for  selecting  the  weight  coefficients 
ai.  For  this  task,  we  rely  on  our  understanding 
of  the  focus  parallel  application,  and  experience 
with  the  target  parallel  processor. 

After  a  cost  function  is  formulated,  the  de¬ 
composition  is  optimized  by  readjusting  only 
the  subdomain  interfaces.  More  specifically, 
only  the  mesh  entities  that  are  attached  to  the 
interface  are  examined  for  possible  exchange  be¬ 
tween  the  subdomains.  Therefore,  the  compu¬ 
tational  complexity  of  the  optimization  process 
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is  proportional  to  the  interface  size  and  not  to 
the  number  of  elements  in  the  mesh.  In  col¬ 
laboration  with  the  Universite  Catholique  de 
Louvain,  we  have  implemented  three  different 
schemes  for  optimizing  a  given  cost  function. 

Simulated  Annealing  (SA)  [100].  This  al¬ 
gorithm  uses  a  monotonically  decreasing  “tem¬ 
perature”  as  control  variable  for  the  outer  it¬ 
erations.  For  a  fixed  temperature,  a  number 
of  mesh  entities  are  proposed  for  transfer  to  a 
neighboring  subdomain  —  in  the  sequel,  we  re¬ 
fer  to  this  step  as  a  “move”.  The  acceptance 
of  a  move  is  dictated  by  a  probabilistic  decision 
which  depends  on  the  difference  in  cost  between 
making  the  move  or  ignoring  it.  The  optimiza¬ 
tion  process  ends  when  the  temperature  is  suffi¬ 
ciently  low  and  no  further  moves  are  accepted. 
In  the  inner  loop,  moves  are  chosen  randomly. 
The  probability  of  acceptance  of  bad  moves  de¬ 
creases  with  temperature. 

Tabu  Search  (TS)  [101].  This  scheme 
stores  in  a  tabu  list  a  specified  number  of  re¬ 
cently  accepted  moves.  In  the  inner  loop,  sev¬ 
eral  moves  outside  the  tabu  list  are  proposed, 
and  the  move  with  the  highest  positive  or  nega¬ 
tive  gain  is  accepted.  In  the  outer  loop,  the  last 
accepted  move  replaces  the  oldest  move  in  the 
tabu  list.  Therefore,  if  this  algorithm  escapes  a 
local  minimum,  it  cannot  use  the  same  path  in 
the  solution  space  to  reach  this  minimum  again. 

Stochastic  Evolution  (SE)  [102].  The  main 
difference  between  this  algorithm  and  Simu¬ 
lated  Annealing  is  in  the  evolution  of  the  con¬ 
trol  variable  and  the  selection  of  the  moves.  At 
each  outer  iteration,  all  interface  elements  are 
proposed  for  a  move  in  a  predefined  order.  The 
temperature  decreases  rapidly,  thereby  decreas¬ 
ing  the  probability  of  accepting  bad  moves,  un¬ 
til  the  solution  reaches  a  local  minimum  of  the 
cost  function.  At  this  point,  the  temperature  is 
reset  to  its  initial  value.  In  general,  this  algo¬ 
rithm  behaves  as  a  series  of  fast  SA  processes 
where  the  solution  jumps  from  one  local  mini¬ 
mum  to  another. 


A  quality /speed  trade-off  can  be  applied  to  each 
of  the  above  optimization  schemes  by  “tuning” 
a  few  control  parameters  [98]. 

There  is  at  least  one  compelling  reason  for 
having  more  than  one  optimization  algorithm 
at  hand.  In  some  cases,  the  initial  mesh  par¬ 
tition  generated  in  Step  1  can  get  entrapped 
in  a  local  minimum  at  the  first  step  of  an  op¬ 
timization  scheme,  in  which  case  Step  2  does 
not  improve  the  original  decomposition.  One 
can  hope  that  switching  to  another  optimiza¬ 
tion  algorithm  pulls  the  solution  out  of  that  lo¬ 
cal  minimum.  Everytime  we  have  encountered 
this  problem  for  SA,  we  were  able  to  resolve  it 
by  switching  to  TS. 

In  order  to  illustrate  the  two-step  method¬ 
ology  described  above  and  highlight  its  po¬ 
tential,  we  consider  the  partitioning  of  two 
three-dimensional  fluid  dynamics  unstructured 
meshes  into  64  and  128  subdomains.  The  first 
mesh,  EALC,  is  designed  for  the  simulation  of 
external  Euler  flows  around  a  Falcon  aircraft.  It 
contains  439272  tetrahedra  and  77279  vertices. 
The  second  mesh,  MUFF,  is  designed  for  the 
simulation  of  internal  viscous  flows  inside  a  car 
muffler  (Fig.  24).  It  contains  237963  tetrahe¬ 
dra  and  43592  vertices.  Here,  we  assume  that 
the  objective  is  to  generate  mesh  partitions  with 
equal  number  of  tetrahedra  and  minimum  num¬ 
ber  of  interface  points.  Hence,  the  load  balance 
factor  can  be  written  in  this  case  as  follows 

IBF  = 

max  it  Ne 

More  complex  objectives  are  discussed  in  Sec¬ 
tion  8.6. 
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characteristics  of  the  resulting  mesh  partitions 
are  summarized  in  Tables  1  and  2.  All  com¬ 
putations  are  performed  on  a  Crimson  Silicon 
Graphics  workstation. 

The  results  reported  in  Table  1  show  that 
for  the  FALC  mesh,  RSB  outperforms  GR  for 
the  imposed  objective.  The  mesh  partition  pro¬ 
duced  by  RSB  has  18.40%  less  interface  points 
than  that  delivered  by  GR,  but  costs  48.07 
times  more  CPU  time  to  generate.  On  the 
other  hand,  the  two-step  partitioning  method- 
Fig.  24.  Three-dimensional  discretization  of  ology  with  GR  as  an  initial  decomposer  outper- 
the  flow  domain  inside  a  car  muffler  forms  RSB  for  the  same  objective.  The  mesh 

partition  generated  by  GR  and  optimized  by 
First,  the  GR  and  multilevel  RSB  algo-  SA  has  9.83%  less  interface  points  than  that  de- 
rithms  are  used  to  partition  the  FALC  mesh  livered  by  RSB  and  costs  1.97  times  less  CPU 
into  64  subdomains,  and  the  MUFF  mesh  into  time  to  produce.  For  the  MUFF  mesh,  the  re- 
128  subdomains.  Following  the  recommenda-  suits  reported  in  Table  2  show  that  GR  out- 
tion  given  in  [93],  the  computational  size  of  the  performs  RSB  for  the  imposed  objective.  More 
lowest  level  is  set  to  300  for  the  RSB  scheme.  specifically,  GR  produces  a  mesh  partition  with 
Next,  the  two-step  methodology  is  applied  to  2.53%  less  interface  points  than  RSB  does  and 
generate  similar  mesh  partitions.  The  GR  al-  115.40  times  faster.  The  two-step  partitioning 
gorithm  is  selected  for  Step  1,  and  the  SA  op-  methodology  with  GR  as  an  initial  decomposer 
timization  scheme  for  Step  2.  For  both  meshes,  outperforms  both  RSB  and  GR,  is  significantly 
the  cost  function  is  defined  as  C  =  0.5  X  (7i  -}-  cheaper  than  RSB,  but  is  also  significantly  more 
0.5  X  C2,  and  the  parameters  N'^  and  expensive  than  GR. 

are  set  to  and  =  NeiNp.  The 

Table  1 

Partitioning  of  the  FALC  mesh:  Ne  =  439272  —  As  =  64 

C  =  0.5  X  Cl  +  0.5  X  C2 

SGI/Crimson 


Scheme 

Optimizer 

Nj 

LBF 

CPU 
Step  1 

CPU 
Step  2 

CPU 

Total 

RSB 

None 

21  139 

0.999 

707.10  s. 

0.00  s. 

707.10  s. 

GR 

None 

25  906 

0.999 

14.71  s. 

0.00  s. 

14.71  s. 

GR 

SA 

19  060 

0.999 

14.71  s. 

342.76  s. 

357.47  s. 
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Table  2 

Partitioning  of  the  MUFF  mesh;  Ne  =  237963  —  Ns  =  128 

C  =  0.5  X  Cl  +  0.5  X  C2 

SGI/Crimson 


Scheme 

Optimizer 

Ni 

LBF 

CPU 
Step  1 

CPU 
Step  2 

CPU 

Total 

RSB 

None 

17  810 

0.999 

791.69  s. 

0.00  s. 

791.69  s. 

GR 

GR 

None 

SA 

17  358 

14  934 

0.999 

0.996 

6.86  s. 

6.86  s. 

0.00  s. 

551.72  s. 

6.86  s. 

558.58  s. 

For  the  above  two  examples,  we  have  used 
GR  as  an  initial  decomposer  in  order  to  keep 
the  total  partitioning  costs  as  low  as  possible. 
However,  if  preprocessing  costs  are  not  an  is¬ 
sue,  RSB  can  also  be  used  in  Step  1.  For  the 
MUFF  mesh,  the  two-step  method  with  RSB 
as  an  initial  decomposer  generates  a  mesh  par¬ 
tition  with  14252  interface  points  and  consumes 
1212.49  s.  GPU  (791.69  s.  (Step  1)  +  420.80  s. 
(Step  2)).  This  particular  example  shows  that 
when  an  initial  mesh  partition  is  slightly  better 
than  another  one,  its  optimized  version  is  not 
necessarily  better  than  the  optimized  version  of 
that  other  one. 

Also  note  that  for  the  above  problems,  all 
algorithms  including  the  two-step  methodology 
deliver  mesh  partitions  with  perfect  load  bal¬ 
ance  factors. 

We  are  particularly  interested  in  fast  and 
good  partitioning  algorithms  because  we  would 
like  to  be  able  to  inspect  —  possibly  interac¬ 
tively  —  a  few  mesh  decompositions  before  se¬ 
lecting  one  for  a  target  parallel  application.  The 
examples  reported  above  highlight  the  poten¬ 
tial  of  the  two-step  methodology  for  generating 
excellent  mesh  partitions.  However,  the  opti¬ 
mization  step  is  not  as  fast  as  we  would  like  it 
to  be.  Next,  we  present  a  contraction  proce¬ 
dure  for  speeding  up  the  optimization  process 
in  Step  2. 


8.5.  An  Efficient  Contraction  Procedure 

The  idea  of  contracting  a  mesh  before  parti¬ 
tioning  it  is  not  new.  Apparently,  it  was  first 
proposed  in  [93]  for  reducing  the  costs  of  the 
RSB  partitioning  scheme,  and  in  [103]  for  stor¬ 
age  optimization  purposes.  The  contraction  ap¬ 
proach  presented  in  [93]  is  based  on  the  con¬ 
cept  of  maximal  independent  sets.  The  con¬ 
traction  approach  proposed  herein  is  based  on 
the  Greedy  algorithm  and  our  experience  with 
this  heuristic.  Our  main  objective  is  to  speed 
up  the  optimization  process  in  Step  2  of  the 
partitioning  methodology.  Our  main  strategy 
goes  as  follows. 

First,  the  mesh  is  recursively  coarsened  us¬ 
ing  an  O  {Ne)  Greedy-based  contraction  pro¬ 
cedure  until  its  size  reaches  a  user  specified 
value,  say  Nme  =  5000  macro-elements.  An  ini¬ 
tial  decomposition  is  performed  on  the  coarse 
mesh  using  preferably  a  fast  mesh  partition¬ 
ing  algorithm.  This  decomposition  is  followed 
by  a  few  smoothing  iterations  using  one  of  the 
three  optimization  schemes  introduced  in  Sec¬ 
tion  3.  Next,  the  obtained  coarse  partition  is 
mapped  onto  the  original  and  finer  mesh,  and 
another  optimization  is  performed  on  the  fine 
level.  When  more  than  one  level  of  contrac¬ 
tions  is  needed  to  reach  the  specified  number 
of  macro-elements  Nme,  coarse-to-fine  mapping 
and  optimization  are  performed  at  every  inter¬ 
mediate  level. 
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More  specifically,  the  contraction  step  is 
implemented  as  follows.  Given  a  starting  ele¬ 
ment,  a  fixed-size  cluster  is  constructed  by  ag¬ 
glomerating  neighboring  elements  in  a  recur¬ 
sive  manner.  This  cluster  defines  a  macro¬ 
element  in  the  contracted  mesh.  At  the  begin¬ 
ning,  the  starting  element  is  selected  among  the 
peripheral  elements.  Later,  it  is  selected  among 
those  elements  which  neighbor  existing  clusters. 
The  contraction  ends  when  all  elements  are  at¬ 
tributed  to  a  cluster.  In  practice,  we  have  found 


that  5  elements  is  a  good  choice  for  the  size  of 
a  cluster.  However,  fewer  or  more  elements  can 
sometimes  define  a  cluster  for  connexity  pur¬ 
poses. 

The  impact  of  the  contraction  procedure 
described  above  on  the  two-step  partitioning 
methodology  is  highlighted  in  Tables  3  and  4 
for  the  FALC  and  MUFF  meshes,  respectively. 


Table  3 

Partitioning  of  the  FALC  mesh:  Ne  —  439272  —  As  =  64 

C  =  0.5  X  Gi  -f  0.5  X  C2 

Effects  of  the  contraction  procedure 

SGI/Crimson 


Scheme 

Optimizer 

Ni 

LBF 

CPU 

Contr. 

CPU 
Step  1 

CPU 
Step  2 

CPU 

Total 

RSB 

None 

21  139 

0.999 

0.00  s. 

707.10  s. 

0.00  s. 

707.10  s. 

GR 

None 

25  906 

0.999 

0.00  s. 

14.71  s. 

0.00  s. 

14.71  s. 

GR 

SA 

19  060 

0.999 

0.00  s. 

14.71  s. 

342.76  s. 

357.47  s. 

Contr.  -|-  GR 

Contr. -t-  SA 

16  160 

0.999 

6.65  s. 

0.08  s. 

38.36  s. 

45.09  s. 

Table  4 

Partitioning  of  the  MUFF  mesh:  Ne  = 

237963 

—  Ns  = 

128 

C  =  0.5  X  Gi  +  0.5  X  C2 

Effects  of  the  contraction  procedure 

SGI/Crimson 


Scheme 

Optimizer 

Ni 

LBF 

CPU 

Contr. 

CPU 
Step  1 

CPU 
Step  2 

CPU 

Total 

RSB 

None 

17  810 

0.999 

0.00  s. 

791.69  s. 

0.00  s. 

791.69  s. 

GR 

None 

17  358 

0.999 

0.00  s. 

6.86  s. 

0.00  s. 

6.86  s. 

GR 

SA 

14  934 

0.996 

0.00  s. 

6.86  s. 

551.72  s. 

558.58  s. 

Contr.  +  GR 

Contr.  -|-  SA 

12  792 

0.999 

3.02  s. 

0.12  s. 

143.70  s. 

146.84  s. 

For  the  FALC  mesh  and  64  subdomains,  the 
contraction  procedure  is  shown  to  reduce  the 
cost  of  Step  2  by  a  full  order  of  magnitude. 
In  that  case,  the  two-step  partitioning  method 
with  GR  as  an  initial  decomposer  produces  a 


mesh  partition  with  23.55%  less  interface  nodes 
than  that  generated  by  RSB,  and  is  15.68  times 
faster  than  RSB. 
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For  the  MUFF  mesh  and  128  subdomains, 
the  two-step  partitioning  method  with  graph 
contraction  and  GR  as  an  initial  decomposer 
produces  a  mesh  partition  that  has  28.17%  less 
interface  nodes  than  the  RSB  partition,  and  is 
5.39  times  faster  than  RSB. 

The  performance  results  reported  in  Ta¬ 
bles  3  and  4  also  show  that  the  proposed  con¬ 
traction  procedure  not  only  speeds  up  the  two- 
step  partitioning  method,  but  also  results  in 
better  mesh  decompositions.  Indeed,  the  con¬ 
tracted  mesh  represents  the  structure  of  the 
original  grid,  and  the  optimization  of  its  decom¬ 
position  tends  to  improve  the  global  structure 
of  the  desired  mesh  partition  by  moving  sev¬ 
eral  elements  simultaneously.  When  the  mesh 
is  not  contracted,  the  global  structure  of  the 
mesh  partition  remains  identical  to  that  of  the 
initial  decomposition  because  the  probability  of 
transferring  large  amounts  of  elements  between 
the  initial  subdomains  is  usually  low. 

As  mentioned  earlier,  a  quality/speed  trade¬ 
off  can  be  applied  to  each  of  the  three  optimiza¬ 
tion  schemes  by  “tuning”  some  of  their  control 


parameters.  An  example  of  such  trade-off  is  il¬ 
lustrated  in  Table  5  for  the  FALC  mesh  and 
various  number  of  subdomains.  From  the  re¬ 
sults  reported  in  this  table,  it  follows  that,  for 
the  cost  function  C  =  0.5  X  Gi  -|-  0.5  X  C2, 
the  two-step  partitioning  method  with  contrac¬ 
tion  can  generate  even  better  mesh  partitions 
when  the  optimization  algorithm  is  allowed  to 
run  longer  in  Step  2.  Note  that  even  in  that 
case,  the  two-step  method  is  still  significantly 
cheaper  than  the  multilevel  RSB  algorithm.  For 
example,  it  can  generate  a  64-subdomain  par¬ 
tition  for  the  FALC  mesh  with  14  613  interface 
points  only  in  161.04  seconds,  whereas  the  RSB 
scheme  consumes  707.10  seconds  to  generate  a 
64-subdomain  mesh  partition  with  21  139  inter¬ 
face  points  (see  Table  3).  This  amounts  to  an 
almost  twice  better  mesh  partition  at  a  quarter 
of  the  price.  The  performance  results  summa¬ 
rized  in  Table  5  also  show  that  the  complexity 
of  the  two-step  partitioning  method  with  con¬ 
traction  is  sublinear  with  the  number  of  subdo¬ 
mains. 


Table  5 

Partitioning  of  the  FALC  mesh:  Ne  =  439272  —  As  =  64 
C  =  0.5  X  Cl  4-  0.5  X  C2 

Two-step  partitioning  method  with  contraction 
Initial  decomposer  =  GR  -  optimization  scheme  =  SA 
Computational  complexity  of  Step  2  -  quality/speed  trade-offs 
SGI/Crimson 


As 

CPU  Step  1 

Nj 

(QUALITY) 

Ni 

(SPEED) 

CPU  Step  2 
(QUALITY) 

CPU  Step  2 
(SPEED) 

2 

0.02  s. 

1834 

2371 

12.80  s. 

6.19  s. 

4 

0.03  s. 

4291 

4853 

32.90  s. 

12.72  s. 

8 

0.03  s. 

5997 

6903 

51.19  s. 

17.40  s. 

16 

0.04  s. 

8240 

9413 

94.09  s. 

27.01  s. 

32 

0.06  s. 

11414 

12682 

124.40  s. 

29.93  s. 

64 

0.08  s. 

14613 

16160 

161.04  s. 

38.44  s. 

128 

0.16  s. 

18740 

20520 

207.30  s. 

47.78  s. 

In  the  remainder  of  this  section,  we  use  ex- 
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clusively  GR  for  all  initial  decompositions.  We 
show  that  in  all  cases,  the  two-step  partition¬ 
ing  methodology  with  the  contraction  proce¬ 
dure  described  herein  is  a  cheaper  and  better 
alternative  to  the  multilevel  RSB  algorithm. 

8.6.  Highlights 

The  two-step  decomposition  methodology  and 
the  contraction  procedure  described  in  this  sec¬ 
tion  are  available  in  the  TOP/DOMDEC  [82] 
interactive  software  package  for  mesh  partition¬ 
ing  and  parallel  processing.  Here,  we  illustrate 
these  two  methodologies  with  examples  from 
computational  structural  mechanics  and  fluid 
dynamics,  and  highlight  their  impact  on  the 
parallel  solution  time  of  these  problems  on  an 
iPSC-860  multiprocessor  and  a  Convex  Meta 
Series  system. 

Mesh  partitioning  algorithms  are  often 
evaluated  and/or  benchmarked  by  simply  as¬ 
sessing  and/or  comparing  the  characteristics  of 
the  mesh  partitions  they  generate  (interface 
size,  theoretical  load  balance  factors,  ...).  Such 
an  approach  is  at  best  incomplete.  The  ulti¬ 
mate  goal  of  a  mesh  partitioning  algorithm  is 
to  reduce,  if  possible,  the  parallel  CPU  time 
of  the  target  parallel  application.  Hence,  mesh 
partitioning  algorithms  should  be  benchmarked 
by  comparing  their  impact  on  problem  solving. 
Here,  we  consider  three  classes  of  applications: 
the  solution  of  a  set  of  semi-discrete  differen¬ 
tial  equations  via  an  explicit  time-integration 
scheme,  the  solution  of  a  system  of  sparse  linear 
equations  via  a  domain  decomposition  based  it¬ 
erative  algorithm,  and  the  solution  of  a  system 
of  sparse  linear  equations  via  a  frontal  method. 

8.6.1.  Explicit  Time-Marching 

First,  we  consider  a  stress  wave  propagation 
problem  in  a  line-pinched  plate  with  a  cir¬ 
cular  hole.  The  plate  is  discretized  using 
47680  4-node  shell  elements  and  48235  nodes 


(Fig.  25).  The  corresponding  number  of  equa¬ 
tions  is  233939.  For  this  problem,  the  semi¬ 
discrete  finite  element  equations  of  dynamic 
equilibrium  are  time-integrated  using  the  ex¬ 
plicit  central  difference  scheme.  Four  differ¬ 
ent  mesh  partitions  are  generated  for  parallel 
computations  on  a  64-processor  iPSC-860  sys¬ 
tem.  The  characteristics  of  these  decomposi¬ 
tions  are  summarized  in  Table  6  where 

■\javeraQe,k  ■xTmax.k  i  at  j  j.  j.*  i 

iVj  ^  I  ^  SjUcI  ly j  QGnotG  respectively^ 

the  minimum,  average,  and  maximum  number 
of  interface  nodes  per  subdomain,  and  the  to¬ 
tal  number  of  interface  nodes  in  the  mesh  par¬ 
tition.  Given  that  the  parallel  performance 
of  the  central  difference  scheme  —  and  most 
explicit  time-integration  algorithms  —  is  gov¬ 
erned  by  load  balancing  and  communication 
costs,  the  cost  function  C  =  0.5  X  (7i -f  0.5  X  (72 

and  the  parameters  and  = 

k-N, 

E  are  used  for  this  application. 

k  =  l 

For  the  above  problem  and  64  subdomains, 
the  interface  size  of  the  mesh  partition  gener¬ 
ated  by  the  RSB  scheme  is  14  %  smaller  than 
that  of  the  mesh  partition  produced  by  the  GR 
algorithm.  On  the  other  hand,  the  two-step 
partitioning  methodology  without  contraction 
reduces  the  interface  size  of  the  GR  decompo¬ 
sition  by  29%,  and  with  contraction  it  reduces 
it  by  36%  (see  Table  6). 
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Table  6 

Partitioning  of  the  plate  mesh:  Ne  =  47680  -  Nn  =  48235  -  Nd  =  233939  - 
C  =  0.5  X  Cl  +  0.5  X  Cz 


Scheme 

Optimizer 

Contraction 

j^^average,k 

j^maXjk 

Ni 

nGR 

Ni 

RSB 

None 

No 

74 

108 

159 

3433 

1.14 

GR 

None 

No 

56 

124 

286 

3912 

1.00 

GR 

SA 

No 

53 

101 

149 

3039 

1.29 

GR 

SA 

Yes 

52 

98 

144 

2876 

1.36 

Fig.  25.  Finite  element  discretization  of 
a  plate  with  a  circular  hole 


The  performance  results  on  a  64-processor 
iPSC-860  system  of  the  transient  analysis  of  the 
plate  problem  are  reported  in  Table  7  for  2000 
integration  time-steps,  and  the  four  generated 
64-subdomain  mesh  partitions.  Throughout 
the  remainder  of  this  section,  Tcomm  and  Tsoi 
denote  respectively  the  communication  time, 
and  the  total  solution  time  for  the  target  par¬ 
allel  application. 


Table  7 

Explicit  central  difference 

Plate  mesh:  Ne  —  47680  -  Nn  =  48235  -  Nd 


233939  - 


Ns  =  64 


Solution  time  for  2000  time-steps  on  an  iPSC-860/64 


Scheme 

Optimizer 

Contraction 

Tcomm 

Tsoi 

nGr 

Ni 

rp  GR 

J-  comm 

Tcomm 

rr\  GR 

IroI 

RSB 

None 

No 

115.28  s. 

706.91  s. 

1.14 

1.20 

1.03 

GR 

None 

No 

138.34  s. 

728.12  s. 

1.00 

1.00 

1.00 

GR 

SA 

No 

101.72  s. 

693.45  s. 

1.29 

1.36 

1.05 

GR 

SA 

Yes 

98.81  s. 

693.41  s. 

1.36 

1.40 

1.05 

Clearly,  the  results  reported  in  Table  7  show 
that  the  communication  costs  of  the  explicit 
central  difference  time-integration  algorithm 
are  directly  related  to  the  number  of  interface 
nodes  (for  this  problem,  it  turns  out  that  all 


generated  mesh  partitions  have  a  similar  aver¬ 
age  number  of  neighboring  subdomains).  How¬ 
ever,  these  results  also  indicate  that  for  this 
class  of  parallel  applications,  there  is  little  to 
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gain  by  searching  for  the  “perfect”  mesh  parti¬ 
tion  with  the  least  number  of  interface  nodes. 
For  example,  the  two-step  mesh  decomposition 
algorithm  with  contraction  reduces  the  inter¬ 
face  size  and  communication  costs  of  the  GR 
partition  by  factors  equal  to  1.36  and  1.40,  re¬ 
spectively,  but  improves  the  total  CPU  time 
corresponding  to  the  GR  partition  by  5%  only. 
Hence,  it  would  seem  that  the  applications  for 
which  one  has  legitimate  reasons  to  prioritize 
the  minimization  of  the  interface  nodes  are  the 
least  sensitive  to  the  size  of  the  subdomain  in¬ 
terfaces.  Of  course,  such  a  statement  assumes 
that  the  given  parallel  processor  is  reasonably 
fast  in  communication,  and  that  the  size  of  the 
problem  to  be  solved  justifies  the  chosen  num¬ 
ber  of  subdomains  or  processors. 

One  could  argue  that  the  above  con¬ 
clusions  hold  only  for  two-dimensional  prob¬ 
lems  where  the  subdomain  interfaces  are  topo¬ 
logically  one-dimensional,  but  not  necessar¬ 
ily  for  three-dimensional  problems  where  the 
subdomain  interfaces  are  topologically  two- 
dimensional,  and  the  average  number  of  neigh¬ 
bors  for  a  given  subdomain  is  higher.  For  this 
reason,  we  investigate  next  the  parallel  per¬ 
formance  of  the  explicit  central  difference  al¬ 
gorithm  applied  to  the  evaluation  of  the  lin¬ 
ear  transient  response  of  a  three-dimensional 


engine  nozzle  subjected  to  a  sudden  pressure 
burst.  The  nozzle  is  discretized  into  12800  8- 
node  brick  elements,  15579  nodes  and  46701 
active  degrees  of  freedom  (Fig.  26).  Four  dif¬ 
ferent  mesh  partitions  are  generated  for  parallel 
computations  on  a  64-processor  iPSC-860  sys¬ 
tem.  The  characteristics  of  these  decomposi¬ 
tions  are  summarized  in  Table  8.  As  for  the 
previous  example,  the  cost  function  is  set  to 
G  =  0.5  X  Gi  +0.5  is  set  to 

k=Ns 

and  ^  jNp  is  adopted. 

fc=i 


Fig.  26.  Three-dimensional  finite  element 
discretization  of  a  nozzle 


Table  8 


Partitioning  of  the  nozzle  mesh:  Ne  =  12800  -  Nn  =  15579  -  Nd  =  46701  - 
C  =  0.5  X  Cl  +  0.5  X  C2 


Scheme 

Optimizer 

Contraction 

j^averagtyh 

j^maXjk 

Ni 

nGR 

Ni 

RSB 

None 

No 

116 

185 

272 

5401 

1.12 

GR 

None 

No 

129 

212 

316 

6068 

GR 

TS 

No 

134 

191 

259 

5494 

GR 

TS 

Yes 

120 

177 

220 

5079 

1.19 

For  the  nozzle  problem  and  64  subdomains,  the 
mesh  partition  generated  by  the  RSB  scheme 

has  1.12  times  less  interface  nodes  than  that  produced  by  the  GR  algorithm.  The  two-step 


partitioning  methodology  with  contraction  re¬ 
duces  the  interface  size  of  the  GR  decomposi¬ 
tion  by  a  factor  equal  to  1.19.  Note  that  re¬ 
ducing  the  total  number  of  interface  nodes  also 
seems  to  improve  the  interface  load  balancing 
factor  ILBF  =  ^ 

pie,  ILBF  —  0.67  only  for  the  64-subdomain 
mesh  partition  generated  by  the  GR  algorithm, 


while  ILBF  =  0.80  for  that  produced  by  the 
two-step  decomposition  methodology. 

Table  9  reports  the  CPU  time  on  a  64- 
processor  iPSC-860  system  of  a  2000  time-step 
transient  analysis  of  the  engine  nozzle  using  the 
various  64-subdomain  mesh  partitions. 


Table  9 


Explicit  central  difference 

Nozzle  mesh:  Ne  =  12800  -  iV„  =  15579  -  Nd  =  46701  - 
Solution  time  for  2000  time-steps  on  an  iPSC-860/64 


Scheme 

Optimizer 

Contraction 

RSB 

None 

No 

GR 

None 

No 

GR 

TS 

No 

GR 

TS 

Yes 

136.40  s. 


149.08  s. 
139.46  s. 
129.63  s. 


338.00  s 
362.00  s 
346.00  s 
335.00  s 


t  comm 

-rp - 

comm 


Before  commenting  on  the  performance  results 
summarized  in  Table  9,  it  is  worthwhile  noting 
that  the  iPSC-860  computer  used  for  this  appli¬ 
cation  has  only  8  Mbytes  of  memory  per  proces¬ 
sor.  The  smallest  number  of  processors  on  this 
machine  that  is  a  power  of  two  and  can  meet  the 
storage  requirements  of  this  three-dimensional 
dynamics  application  is  64.  From  Table  8,  it 
follows  that  32%  to  39%  of  the  nodes  of  a  64- 
subdomain  mesh  partition  of  the  nozzle  mesh 
are  interface  nodes.  Hence,  the  hardware  con¬ 
figuration  of  this  iPSC-860  and  the  memory  re¬ 
quirements  of  the  nozzle  dynamics  problem  are 
such  that  the  computational  and  communica¬ 
tion  requirements  of  this  target  parallel  appli¬ 
cation  are  not  particularly  well  balanced.  This 
is  reflected  in  the  performance  results  summa¬ 
rized  in  Table  9  which  show  that  38%  to  41% 
of  the  total  CPU  time  is  spent  in  communi¬ 
cation.  To  some  extent,  this  situation  is  typi¬ 
cal  of  three-dimensional  finite  element  problems 


solved  on  small  memory  massively  parallel  pro¬ 
cessors.  In  Table  9,  it  is  shown  that  RSB  im¬ 
proves  the  communication  time  over  GR  by  a 
factor  equal  to  1.09,  and  the  two-step  partition¬ 
ing  methodology  with  contraction  improves  the 
communication  time  over  GR  by  a  factor  equal 
to  1.15.  These  factors  are  consistent  with  those 
describing  the  reduction  of  the  number  of  in¬ 
terface  nodes.  However,  for  the  enhanced  mesh 
partitions,  the  total  CPU  time  is  only  7%  to 
8%  better  than  that  corresponding  to  the  GR 
partition,  which  is  also  consistent  with  the  dis¬ 
tribution  of  the  total  simulation  time  between 
computation  and  communication. 

In  summary,  minimizing  the  number  of 
interface  nodes  of  a  mesh  partition  does  im¬ 
prove  the  total  CPU  time  of  this  class  of  paral¬ 
lel  applications,  but  not  by  impressive  factors. 
Stated  differently,  unless  communication  costs 
are  overwhelming  —  in  which  case  parallel  pro¬ 
cessing  is  not  necessarily  attractive  —  any  rea¬ 
sonable  mesh  partition  is  suitable  for  this  type 
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of  parallel  applications.  This  fact  is  rarely  rec¬ 
ognized  in  the  parallel  processing  literature. 

8.6.2.  Domain  Decomvosition  Based 
Iterative  Solvers 

Here,  we  focus  on  the  solution  of  the  system 
of  equations  arising  from  the  finite  element 
static  analysis  of  an  elastic  bearing  under  a 
distributed  surface  load.  The  finite  element 
model  of  this  three-dimensional  structure  con¬ 
tains  9600  8-node  brick  elements  and  33075  de¬ 
grees  of  freedom  (Fig.  27).  The  optimal  domain 
decomposition  based  FETI  iterative  solver  (see 
Section  5)  is  used  for  parallel  computations  on 
a  64-processor  iPSC-860  system.  Three  64- 
subdomain  mesh  partitions  are  generated  using 
RSB,  GR,  and  the  two-step  mesh  partitioning 
method  with  (7  =  0.5xC'2+0.5xC'3,  iV*"  =  iVj" 
and  =  Ng/Np.  The  characteristics  of 

these  mesh  partitions  and  the  corresponding 
performance  results  of  the  FETI  solver  are  re¬ 
ported  in  Table  10  where  AR  and  Nur  denote 
respectively  the  average  subdomain  aspect  ratio 
and  the  number  of  FETI  iterations  for  conver¬ 
gence. 

>!  <  <t  i  I  > 


Fig.  27.  Finite  element  discretization 
of  an  elastic  bearing 

For  this  application,  it  is  clear  that  the  size 
of  the  interface  problem  does  not  control  nei¬ 
ther  the  communication  time  nor  the  total  CPU 
time  of  the  domain  decomposition  solver.  In 


particular,  note  that  for  the  above  problem 
there  is  no  correlation  between  Nj  and  the  com¬ 
munication  costs  per  FETI  iteration.  This  is 
essentially  because  the  communication  costs  of 
this  application  are  dominated  by  those  associ¬ 
ated  with  global  dot  products  and  some  other 
full  matrix  linear  algebra  on  a  coarse  grid  prob¬ 
lem.  On  the  other  hand,  the  results  reported  in 
Table  10  clearly  demonstrate  the  importance  of 
the  subdomain  aspect  ratio  for  this  class  of  ap¬ 
plications.  The  two-step  mesh  decomposition 
method  with  contraction  improves  the  subdo¬ 
main  aspect  ratio  of  the  mesh  partitions  gener¬ 
ated  by  GR  and  RSB  by  a  factor  equal  to  1.7, 
which  reduces  the  number  of  FETI  iterations 
by  a  factor  equal  to  1.5,  and  the  total  solution 
time  by  a  factor  equal  to  1.4. 
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Table  10 


Optimal  FETI  solver 

Bearing  mesh:  Ne  =  9600  -  Nd  —  33075  - 

Effect  of  the  subdomain  aspect  ratio 


Scheme 

Optimizer 

Contraction 

Ni 

AR 

Tcomm  1 N itr 

Nitr 

Tsoi 

RSB 

None 

No 

5  426 

0.50 

0.37  s. 

45 

36.09  s. 

GR 

None 

No 

5  032 

0.52 

0.37  s. 

43 

35.17  s. 

GR 

SA 

Yes 

4  430 

0.84 

0.40  s. 

30 

25.77  s. 

8.6.3.  Parallel  Frontal  Solvers 

The  problem  of  computing  the  steady-state  flow 
of  an  incompressible  Oldroyd  fluid  in  a  two- 
cam  mixing  apparatus  arises  in  polymer  pro¬ 
cessing  applications.  This  problem  is  governed 
by  a  set  of  mixed  elliptic/hyperbolic  nonlinear 
partial  differential  equations.  Here,  we  con¬ 
sider  such  a  problem  and  the  flow  domain  de¬ 
picted  in  Fig.  28.  Its  finite  element  discretiza¬ 
tion  contains  1217  elements  only,  but  generates 
26082  equations.  At  each  Newton  iteration, 
these  equations  are  solved  with  the  frontal  di¬ 
rect  solver  described  in  [99]. 


Fig.  28.  Discretization  of  the  flow  domain 
in  a  two-cam  mixing  apparatus 

Among  all  parallel  applications,  the  frontal 
direct  solver  is  perhaps  the  most  challenging 
one  for  mesh  partitioning.  Ideally,  this  algo¬ 
rithm  requires  a  mesh  partition  where:  (a)  each 
subdomain  frontwidth  is  smaller  or  equal  to  the 
frontwidth  of  the  global  problem,  (b)  the  com¬ 
putational  load  is  perfectly  balanced,  and  (c) 


the  subdomain  interfaces  have  a  minimum  and 
equal  number  of  nodes.  Criterion  (a)  should 
be  emphasized,  because  trading  computational 
efficiency  for  parallelism  is  not  always  a  win¬ 
ning  strategy.  Enforcing  criterion  (b)  is  a  seri¬ 
ously  difficult  task,  because  the  computational 
load  per  subdomain  cannot  be  derived  a  pri¬ 
ori  from  the  computational  complexity  of  the 
global  problem.  Criterion  (c)  attempts  at  mini¬ 
mizing  the  communication  and  storage  require¬ 
ments  associated  with  the  elimination  of  the  in¬ 
terface  unknowns. 

Here,  four  8-subdomain  mesh  partitions 
are  generated  for  parallel  computations  on  an 
8-processor  Convex  Meta  Series  system,  using 
GR,  RSB,  and  the  two-step  mesh  partitioning 
method  with  both  GR  and  RSB  as  initial  de¬ 
composers.  For  this  application,  the  cost  func¬ 
tion  to  be  optimized  is  set  to  0.5  X  Ci  -1-0.5  X  C'2. 
However,  note  that  in  this  case  can¬ 

not  be  determined  a  priori.  Let  FR'^  and 
pj^max,k  respectively  the  variable  sub- 

domain  frontwidth  and  its  maximum  value. 
During  the  optimization  (or  retrofitting)  pro¬ 
cess,  AT*’®®*’*’  is  computed  so  that  iV*’®®*’*^  X 
p^max.k  same  in  all  subdomains. 

The  characteristics  of  all  four  mesh  par¬ 
titions  and  the  corresponding  performances  of 
the  parallel  frontal  solver  are  reported  in  Ta¬ 
ble  11  where  EFRLBF  —  average X 
Fi7^^)/maxfc(iVe*’xFR*'^)  is  the  estimated  com¬ 
putational  load  balance  factor,  Tj^ntemai  ^^e 
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parallel  CPU  time  associated  with  the  elimina¬ 
tion  of  the  subdomain  internal  unknowns,  and 
Tsoi  is  the  total  parallel  solution  time.  An  in¬ 


ternal  renumbering  scheme  [32]  is  used  in  every 
subdomain  for  minimizing  fill-in. 


Table  11 


Parallel  frontal  solver 


Polymer  flow  mesh:  Ne  =  1217  -  Nd  =  26082  - 


Effects  of  the  subdomain  frontwidth  and  load  balancing 


Scheme 

Opt. 

Ni 

pij^average,k 

FFRLBF 

rjtCLverage,k 

inte.rnal 

Tsoi 

rpTnax,k 

internal 

RSB 

None 

97 

282.87 

0.53 

0.60 

135.96  s. 

RSB 

SA 

88 

269.38 

0.83 

0.85 

80.01  s. 

139 

393.75 

0.47 

230.53  s. 

85 

252.50 

0.67 

0.66 

102.48  s. 

The  ability  of  EFRLBF  to  predict  the  com¬ 
putational  load  balance  of  the  parallel  frontal 
solver  is  well  illustrated  in  Table  11.  Also,  the 
suitability  of  the  selected  cost  function  and  the 
effectiveness  of  the  optimization  algorithm  are 
well  demonstrated.  For  example,  the  run-time 
load  balance  factor  for  the  RSB  mesh  partition 
is  equal  to  0.53,  while  that  of  the  optimized 
RSB  partition  is  equal  to  0.83.  The  net  result 
of  the  optimization  process  is  a  speedup  factor 
in  the  solution  time  equal  to  1.69.  For  the  GR 
partition,  the  net  result  of  the  retrofitting  step 
is  a  speedup  factor  equal  to  2.25.  Note  also 
that  for  the  above  problem,  the  mesh  partition 
that  leads  to  the  fastest  parallel  solution  of  the 
linearized  equations  is  neither  the  one  with  the 
minimum  number  of  interface  nodes,  nor  that 
with  the  minimum  subdomain  frontwidth,  but 
the  mesh  partition  with  the  best  predicted  load 
balance  factor  —  and  it  also  turns  out  to  be 
the  mesh  partition  with  the  best  run-time  load 
balance  factor. 

10.  APPLICATIONS  AND 
PERFORMANCE  RESULTS 

Here,  we  demonstrate  the  aeroelastic  compu¬ 
tational  methodology  described  in  the  previous 
sections  with  the  numerical  investigation  of  the 


instability  of  flat  panels  with  infinite  aspect  ra¬ 
tio  in  supersonic  airstreams,  and  the  solution 
of  three-dimensional  wing  response  problems  in 
the  transonic  regime.  All  flow  computations  are 
performed  using  the  Euler  equations  and  the 
explicit  solver. 

10.1.  Two-Dimensional  Aeroelastic 
Supersonic  Computations 

10.1.1.  Problem  Definition 

The  flat  panel  with  infinite  aspect  ratio  con¬ 
sidered  here  (Fig.  29)  is  assumed  to  have 
a  length  L  =  0.5m,  a  uniform  thickness 

h  =  1.35  X  10“^  m,  a  Young  modulus 
F  =  7.728  X  10^°  Njm?,  a  Poisson  ratio 
II  =  0.33,  a  density  p  =  2710  Kglm^,  and 
to  be  clamped  at  both  ends.  Its  rectangular 
cross  section  is  discretized  into  1111  X  3  plane 
strain  4-node  elements.  This  fine  discretization 
—  which  generates  3333  elements  with  perfect 
aspect  ratios  and  4448  nodes  —  is  not  needed 
for  accuracy;  we  have  designed  this  structural 
mesh  only  because  we  are  also  interested  in 
assessing  some  computational  and  I/O  perfor¬ 
mance  issues. 

The  two-dimensional  flow  domain  above 
the  panel  is  discretized  into  32568  triangles  and 
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16512  vertices.  A  slip  condition  is  imposed  at 
the  fluid/structure  boundary.  Because  the  fluid 
and  structural  meshes  are  not  compatible  at 
their  interface  (Fig.  30),  the  Matcher  software 
[35]  is  used  to  generate  in  a  single  preprocessing 
step  the  data  structures  required  for  transfer¬ 
ring  the  pressure  load  to  the  structure,  and  the 
structural  deformations  at  the  upper  surface  of 
the  panel  to  the  fluid. 

We  consider  several  supersonic  flows  at  dif¬ 
ferent  Mach  numbers  and  discuss  the  perfor¬ 
mances  of  the  partitioned  analysis  procedures 
ALGO,  ALGl,  ALG2,  and  ALG3.  Whenever 
subcycling  is  used,  the  interpolation  scheme 
is  used  to  prescribe  the  motion  of  the  fluid  grid 
points  on  F 

10.1.2.  Computational  Platform 

All  computations  are  performed  on  an  iPSC-860 
parallel  processor  using  double  precision  arith¬ 
metic.  The  fluid  and  structure  solvers  are  im¬ 
plemented  as  separate  programs  that  commu¬ 
nicate  via  the  intercube  communication  proce¬ 
dures  described  in  [104]. 

10.1.3.  Assessment  of  the  Partitioned  Procedures 

In  order  to  illustrate  the  relative  merits  of  the 
partitioned  procedures  ALGO,  ALGl,  ALG2 
and  ALG3,  we  consider  first  two  different  se¬ 
ries  of  transient  aeroelastic  simulations  at  Mach 
number  Moo  =  1.90  that  highlight 


triggered  by  a  displacement  perturbation  of  the 
panel  along  its  first  mode  (Fig.  32). 


Fig.  30.  Mesh  incompatibility 


•  the  relative  accuracy  of  these  algorithms 
for  a  fixed  subcycling  factor  us/f- 

•  the  relative  speed  of  these  algorithms  for  a 
fixed  level  of  accuracy,  on  both  sequential 
and  parallel  computational  platforms. 

In  all  cases,  64  processors  are  allocated  to 
the  fluid  system,  and  2  processors  are  assigned 
to  the  structural  solver.  Initially,  a  steady-state 
flow  is  computed  above  the  panel  at  Moo  =  1-90 
(Fig.  31),  speed  at  which  the  panel  described 
above  is  not  supposed  to  flutter.  Then,  the 
aeroelastic  response  of  the  coupled  system  is 


Fig.  31.  Pressure  isovalues  for  the 
steady-state  flow  solution  (Moo  =  1-90) 
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Fig.  33.  Lift  coefficient  history  for  =  30 


0.0001 

Fig.  32.  Initial  perturbation  of  the  panel 

displacement  field  se-os 

0 

First,  the  subcycling  factor  is  fixed  to  S 
ns/F  =  30,  and  the  lift  coefficient  is  computed 
using  the  time-step  At  =  3.9  X  10“®  correspond¬ 
ing  to  the  stability  limit  of  the  explicit  flow 
solver  in  the  absence  of  coupling  with  the  struc¬ 
ture.  The  obtained  results  are  depicted  in  Fig. 

33  for  the  first  4102  time-steps.  For  us/f  =  30, 

ALGl  and  ALG3  exhibit  essentially  the  same  Fig.  34.  Lift  coefficient  history 

accuracy.  In  the  long  run,  their  amplitude  and  ^  fixed  level  of  accuracy 


Iso  precision  subcycling 


0  0.002  0.004  0.006  0.008  0.01  0.012  0.014  0.016 

Time  (s) 


phase  errors  are  less  important  than  those  of 
ALG2.  Clearly,  this  highlights  the  superiority 
of  ALG3  which,  despites  its  inter-field  paral¬ 
lelism  and  unlike  ALG2,  is  capable  of  delivering 
the  same  accuracy  as  the  sequential  algorithm 
ALGl. 


The  performance  results  measured  on  the 
iPSC-860  are  reported  in  Table  12  where  ICC 
denotes  the  intercube  communication  time. 
Note  that  ICC  is  measured  in  the  fluid  kernel 
and  includes  idle  time  when  the  flow  and  struc¬ 
tural  communications  do  not  overlap. 


Next,  the  relative  speed  of  the  focus  parti¬ 
tioned  solution  procedures  is  assessed  by  com¬ 
paring  their  CPU  performance  for  a  certain 
level  of  accuracy  dictated  by  ALGO.  It  turns 
out  that  in  order  to  meet  the  accuracy  re¬ 
quirements  of  ALGO,  ALGl  and  ALG3  can  use 
a  subcycling  factor  as  large  ns/ f  =  10, 
but  ALG2  can  subcycle  only  up  to  n^/jp  =  5 
(Fig.  34). 
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Table  12.  Performance  results  on  the  iPSC-860 
Fluid:  64  processors  Structure:  2  processors 


Elapsed  time  for  4102  fluid  time-steps 


Algorithm 

Fluid 

Structure 

Fluid-Wait+ICC 

Total  CPU 

ALGO 

2617.23  s. 

1267.93  s. 

1283.10  s. 

3900.33  s. 

ALGl 

2625.11  s. 

126.67  s. 

127.90  s. 

2753.01  s. 

{ns/F  =  10) 

ALG2 

2643.57  s. 

253.34  s. 

1.67  s. 

2645.24  s. 

{ns/F  =  5) 

ALG3 

2603.56  s. 

253.23  s. 

1.37  s. 

2604.93  s. 

{ns/F  -  10) 


From  the  results  reported  in  Table  12,  the 

following  observations  can  be  made 

•  the  fluid  computations  dominate  the  sim¬ 
ulation  time.  This  is  partly  because  the 
structural  model  is  simple  in  this  case,  and 
a  linear  elastic  behavior  is  assumed  for  the 
panel. 

•  considering  that  the  iPSC-860  has  128  pro¬ 
cessors  and  that  only  clusters  of  2”  proces¬ 
sors  can  be  defined  on  this  machine,  allo¬ 
cating  64  processors  to  the  fluid  and  2  pro¬ 
cessors  to  the  structure  achieves  the  mini¬ 
mum  possible  inter-field  load  imbalance  for 
this  coupled  problem. 

•  the  effect  of  subcycling  on  intercube  com¬ 
munication  costs  is  clearly  demonstrated. 
Because  the  flow  solution  time  is  dominat¬ 
ing,  the  effect  of  subcycling  on  the  total 
CPU  time  is  less  important  for  ALG2  and 
ALG3  which  feature  inter-field  parallelism 
in  addition  to  intra-field  multiprocessing, 
than  for  ALGl  which  features  intra-field 

,  parallelism  only  (note  that  ALGl  with 
Tis/f’  =  1  is  identical  to  ALGO). 

•  ALG2  and  ALG3  allow  a  perfect  overlap  of 
inter-field  communications,  which  reduces 
intercube  communication  and  idle  time  to 
less  than  0.3%  of  the  amount  correspond¬ 
ing  to  ALGO. 


•  The  superiority  of  ALG3  over  ALG2  is  not 
clearly  demonstrated  for  this  problem  be¬ 
cause  of  the  simplicity  of  the  structural 
model  and  the  subsequent  load  imbalance 
between  the  fluid  and  structure  computa¬ 
tions. 

10.  LA-  Panel  Flutter 

The  classical  and  analytical  solution  of  the  in¬ 
stability  problem  of  flat  panels  with  infinite  as¬ 
pect  ratio  in  supersonic  airstreams  assumes  a 
shallow  shell  theory  for  the  structure  and  a  lin¬ 
earized  formulation  for  the  flow  problem  (piston 
theory).  Within  this  analytical  approach,  the 
dynamics  of  the  focus  coupled  fluid/structure 
system  are  governed  by  a  fourth-order  partial 
differential  equation  [2,  page  419],  and  the  flut¬ 
ter  condition  is  obtained  by  analyzing  the  roots 
of  the  corresponding  characteristic  equation. 
For  the  panel  described  the  beginning  of  this 
section,  the  classical  linear  theory  predicts  flut¬ 
ter  at  the  critical  Mach  number  =  1.98. 
The  objective  of  this  Section  is  to  validate  the 
aeroelastic  simulation  capability  presented  in 
this  paper  by  reproducing  the  theoretical  crit¬ 
ical  Mach  number  for  the  given  panel.  Note 
that  in  order  to  compare  the  analytical  and  fi¬ 
nite  element  approaches,  the  coefficients  of  the 
shallow  shell  equation  described  in  [2,  page  419] 
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must  be  computed  to  represent  the  same  equa¬ 
tion  as  that  corresponding  to  the  finite  element 
model  used  in  this  paper. 

Four  different  runs  at  Moo  =  2.0, 

Moo  =  2.05,  Moo  =  2.095,  and  Moo  =  2.13 
are  performed  using  ALG3.  For  each  run,  a 
steady-state  flow  is  first  computed.  Then,  a  dis¬ 
placement  perturbation  of  the  panel  along  its 
first  mode  (Fig.  32)  is  imposed,  and  the  aeroe- 
lastic  response  of  the  coupled  system  is  com¬ 
puted.  The  predicted  time  histories  of  the  lift 
coefficient  are  depicted  in  Fig.  35  for  all  four 
cases. 


Fig.  35.  Flutter  analysis 


From  the  results  reported  in  Fig.  35,  it  fol¬ 
lows  that  the  flutter  speed  predicted  by  our  for¬ 
mulation  verifies  2.05  <  M^  <  2.095.  Hence, 
this  flutter  speed  is  4.5  %  higher  than  that  pre¬ 
dicted  by  the  piston  theory.  This  is  a  rather 
good  agreement,  given  that  the  piston  the¬ 
ory  and  the  computational  approach  presented 
herein  do  not  share  exactly  the  same  approxi¬ 
mations. 

Finally,  we  report  in  Fig.  36  the  history 
of  the  accumulated  external  energy  at  Moo  = 
2.095  for  both  the  fluid  and  structural  systems. 
At  this  speed,  the  panel  is  clearly  shown  to  ex¬ 
tract  energy  from  the  fluid,  and  therefore  to 
flutter.  Note  that  Fig.  36  also  highlights  the 
quality  of  the  matching  performed  by  Matcher: 
the  amount  of  external  energy  extracted  by  the 


structure  is  shown  to  be  equal  to  that  lost  by 
the  fluid,  as  it  should  be. 


(r)  A6j0U3 

Fig.  36.  Accumulated  external  energy 
(Moo  =  2.095) 


10.2.  Three-Dimensional  Aeroelastic 
Transonic  Computations 

10.2.1.  Problem  Definition 

Next,  we  consider  transient  aeroelastic  reponse 
problems  associated  with  a  simple  structural 
model  of  the  ONERA  M6  wing. 

The  wing  is  represented  by  an  equivalent 
plate  model  discretized  in  1071  triangular  plate 
elements,  582  nodes,  and  6426  degrees  of  free¬ 
dom  (Fig.  37).  Four  meshes  Ml - M4  are 

designed  for  the  discretization  of  the  three- 
dimensional  flow  domain  around  the  wing.  The 
characteristics  of  theses  meshes  are  given  in  Ta¬ 
ble  13  where  N^er,  Ntet,  Nfac,  and  Nvar  denote 
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respectively  the  number  of  vertices,  tetrahedra, 
facets  (edges),  and  fluid  variables.  A  partial 
view  of  the  discretization  of  the  flow  domain  is 
shown  in  Fig.  38. 


Table  13 

Characteristics  of  meshes  Ml - M4 


Mesh 

Nver 

Ntet 

Nfac 

Ny  ar 

Ml 

15460 

80424 

99891 

77300 

M2 

31513 

161830 

201479 

157565 

M3 

63917 

337604 

415266 

319585 

M4 

115351 

643392 

774774 

576755 

Fig.  37.  Finite  element  plate  model  of  the  wing 


Fig.  38.  Partial  view  of  the  fluid  mesh  Ml 
on  the  skin  of  the  ONERA  M6  wing 

The  sizes  of  the  fluid  meshes  Ml - M4 

have  been  tailored  for  parallel  computations  on 


respectively  16  (Ml),  32  (M2),  64  (M3),  and 
128  processors  (M4)  of  a  Paragon  XP/S  and 
a  Cray  T3D  systems.  In  particular,  the  sizes 
of  these  meshes  are  such  that  the  processors  of 
a  Paragon  XP/S  machine  with  32  Mbytes  per 
node  would  not  swap  when  solving  the  corre¬ 
sponding  flow  problems. 

Here  again,  the  fluid  and  structural  meshes 
are  not  compatible  at  their  interface.  Matcher  [35] 
is  used  to  generate  in  a  single  preprocessing  step 
the  data  structures  required  for  transferring  the 
pressure  load  to  the  structure,  and  the  struc¬ 
tural  deformations  to  the  fluid. 

10.2.2.  Computational  Platforms 

All  computations  are  performed  on  an  iPSC  - 
860,  and/or  a  Paragon  XP/S,  and/or  a 
Cray  T3D,  and/or  an  IBM  SP2  computers  us¬ 
ing  double  precision  arithmetic.  Message  pass¬ 
ing  is  carried  out  via  NX  on  the  Paragon  XP/S 
multiprocessor,  PVM  T3D  on  the  Cray  T3D 
system,  and  MPI  on  the  IBM  SP2  parallel  pro¬ 
cessor.  The  fluid  and  structure  solvers  are  im¬ 
plemented  as  separate  programs  that  commu¬ 
nicate  via  the  intercube  communication  proce¬ 
dures  described  in  [104]. 

10.2.3.  Parallel  Performance  of  the  Flow  Solver 

The  performance  of  the  parallel  flow  solver  is 
assessed  with  the  computation  of  the  steady 
state  of  a  flow  around  the  given  wing  at  a 
Mach  number  Moo  =  0.84  and  an  angle  of 
attack  /3  =  3.06  degrees  (Fig.  39)  .  The 
CFL  number  is  set  to  0.9.  The  four  meshes 

Ml - M4  are  decomposed  in  respectively  16, 

32,  64,  and  128  overlapping  subdomains  using 
TOP/DOMDEC  [82].  The  motivations  for  em¬ 
ploying  overlapping  subdomains  and  the  impact 
of  this  computational  strategy  on  parallel  per¬ 
formance  are  discussed  in  [49].  The  CPU  tim¬ 
ings  in  seconds  are  reported  in  Tables  14-16  for 
the  first  100  iterations  on  a  Paragon  XP/S  ma¬ 
chine  (128  processors),  a  Cray  T3D  system  (128 
processors),  and  an  IBM  SP2  computer  (32  pro¬ 
cessors),  respectively.  In  these  tables,  Np,  N^ar, 
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'^comm')  '^comm'}  'I'comp-^  Ttot  and  m flops  denote 
respectively  the  number  of  processors,  the  num¬ 
ber  of  variables  (unknowns)  to  be  solved,  the 
time  elapsed  in  short  range  interprocessor  com¬ 
munication  between  neighboring  subdomains, 
the  time  elapsed  in  long  range  global  interpro¬ 
cessor  communication,  the  computational  time, 
the  total  simulation  time,  and  the  computa¬ 
tional  speed  in  millions  of  floating  point  op¬ 
erations  per  second.  Typically,  short  range 
communication  is  needed  for  assembling  various 
subdomain  results  such  as  fluxes  at  the  subdo¬ 
main  interfaces,  and  long  range  interprocessor 
communication  is  required  for  reduction  opera¬ 
tions  such  as  those  occurring  in  the  the  evalu¬ 
ation  of  the  stability  time-steps  and  the  norms 
of  the  nonlinear  residuals.  Because  message¬ 
passing  is  also  used  for  synchronization,  the  re¬ 
ported  communication  timings  include  any  idle¬ 
time  due  to  load  imbalance.  We  also  note  that 
we  use  the  same  fluid  code  for  steady  state  and 
aeroelastic  computations.  Hence,  even  though 
we  are  benchmarking  in  Tables  14-16  a  steady 
state  computation  with  a  local  time  stepping 
strategy,  we  are  still  timing  the  kernel  that  eval¬ 
uates  the  global  time-step  in  order  to  reflect  its 


impact  on  the  unsteady  computations  that  we 
perform  in  aeroelastic  simulations  such  as  those 
that  are  discussed  next.  The  mflop  rates  re¬ 
ported  in  Tables  14-16  are  computed  in  a  strict 
manner:  they  exclude  all  the  redundant  com¬ 
putations  associated  with  the  overlapping  sub- 
domain  regions. 


Fig.  39.  Steady-state  Mach  lines 
(ONERA  M6  wing  —  mesh  M4) 


Table  14 

Performance  of  the  parallel  flow  solver  on  the  Paragon  XP/S  system  (16 — 128  processors) 


100  iterations  —  CEL  =  0.9 


Mesh 

Nj, 

rploC 

comm 

rnglo 

comm 

Tcomp 

Ttot 

m flops 

Ml 

16 

2.0  s. 

40.0  s. 

96.0  s. 

138.0  s. 

84 

M2 

32 

4.5  s. 

57.0  s. 

98.5  s. 

160.0  s. 

145 

M3 

64 

7.0  s. 

90.0  s. 

103.0  s. 

200.0  s. 

240 

M4 

128 

6.0  s. 

105.0  s. 

114.0  s. 

225.0  s. 

401 
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Table  15 

Performance  of  the  parallel  flow  solver  on  the  Cray  T3D  system  (16—128  processors) 


100  iterations  —  CFL  =  0.9 


Mesh 

Np 

rploc 

comm 

rpglo 

comm 

Tcomp 

Ttot 

m  flops 

Ml 

16 

1.6  s. 

2.1  s. 

87.3  s. 

91.0  s. 

127 

M2 

32 

2.5  s. 

4.1  s. 

101.4  s. 

108.0  s. 

215 

M3 

64 

3.5  s. 

7.2  s. 

100.3  s. 

111.0  s. 

433 

M4 

128 

3.0  s. 

7.2  s. 

85.3  s. 

95.5  s. 

945 

Table  16 

Performance  of  the  parallel  flow  solver  on 

the  IBM  SP2  system  (4 — 32  processors) 

100  iterations 

—  CFL 

=  0.9 

Mesh 

Np 

rploc 

comm 

rpglo 

comm 

Tcomp 

Ttot 

m  flops 

Ml 

4 

0.8  s. 

0.4  s. 

70.8  s. 

72.0  s. 

160 

M2 

8 

1.1  s. 

0.6  s. 

73.3  s. 

75.0  s. 

308 

M3 

16 

1.4  s. 

0.7  s. 

78.9  s. 

81.0  s. 

594 

M4 

32 

2.0  s. 

1.0  s. 

79.0  s. 

82.0  s. 

1102 

The  reader  can  easily  verify  that  the  num¬ 
ber  of  processors  assigned  to  each  mesh  is  such 
that  Nvar/Np  is  almost  constant.  This  means 
that  larger  numbers  of  processors  are  attributed 
to  larger  meshes  in  order  to  keep  each  local 
problem  within  a  processor  at  an  almost  con¬ 
stant  size.  For  such  a  benchmarking  strategy, 
parallel  scalability  of  the  flow  solver  on  a  target 
parallel  processor  implies  that  the  total  solution 
CPU  time  should  be  constant  for  all  meshes  and 
their  corresponding  number  of  processors.  This 
is  clearly  not  the  case  for  the  Paragon  XP/S 
system.  On  this  machine,  short  range  commu¬ 
nication  is  shown  to  be  inexpensive,  but  long 
range  communication  costs  are  reported  to  be 
important.  This  is  certainly  due  to  the  latency 
of  the  Paragon  XP/S  parallel  processor  which 
is  an  order  of  magnitude  slower  than  that  of  the 
Cray  T3D  system.  Another  possible  source  of 
global  communication  time  increase  is  the  load 
imbalance  between  the  processors  since  message 


passing  is  also  used  for  synchronization.  How¬ 
ever,  this  does  not  seem  to  be  significant  on  the 
T3D  and  SP2  parallel  processors. 

On  the  other  hand,  parallel  scalability  is 
well  demonstrated  for  the  Cray  T3D  and  IBM 
SP2  systems.  The  results  reported  in  Tables 
15  and  16  show  that  all  computations  using 
meshes  Ml - M4  and  the  corresponding  num¬ 

ber  of  processors  consume  almost  the  same  to¬ 
tal  amount  of  CPU  time.  For  128  processors, 
the  Cray  T3D  system  is  shown  to  be  more 
than  twice  faster  than  the  Paragon  XP/S  ma¬ 
chine.  The  difference  appears  to  be  strictly 
in  long  range  communication  as  the  computa¬ 
tional  time  is  reported  to  be  almost  the  same 
on  both  machines.  However,  most  impressive  is 
the  fact  that  an  IBM  SP2  with  32  processors 
only  is  shown  to  be  three  times  faster  than  a 
128  -  processor  Paragon  XP/S,  and  faster  than 
a  Cray  T3D  with  128  processors. 


Fig.  40.  Initial  perturbation  of  the  structure 


10.2.4-  Performance  of  the  Partitioned 
Analysis  Procedures 


As  in  the  two-dimensional  application,  we  con¬ 
sider  first  two  different  series  of  transient  aeroe- 
lastic  simulations  at  Mach  number  Moo  =  0.84 
that  highlight 

•  the  relative  accuracy  of  these  coupled  solu¬ 
tion  algorithms  for  a  fixed  subcycling  fac¬ 
tor  ns/F- 

•  the  relative  speed  of  these  coupled  solution 
algorithms  for  a  fixed  level  of  accuracy. 

In  all  cases,  mesh  M2  is  used  for  the  flow 
computations,  32  processors  of  an  iPSC-860 
system  are  allocated  to  the  fluid  solver,  and 
4  processors  of  the  same  machine  are  assigned 
to  the  structural  code.  Initially,  a  steady- 
state  flow  is  computed  around  the  wing  at 
Moo  =  0.84,  Mach  number  at  which  the 

wing  described  above  is  not  supposed  to  flut¬ 
ter.  Then,  the  aeroelastic  response  of  the  cou¬ 
pled  system  is  triggered  by  a  displacement  per¬ 
turbation  of  the  wing  along  its  first  mode  (Fig. 
40). 


Fig.  41.  Lift  history  for  the  first  half  cycle 
{ns/F  =  10) 


First,  the  subcycling  factor  is  fixed  to 
ns/F  —  10  then  to  ng/F  =  30,  and  the  lift  is 
computed  using  a  time-step  corresponding  to 
the  stability  limit  of  the  explicit  flow  solver  in 
the  absence  of  coupling  with  the  structure.  The 
obtained  results  are  depicted  in  Fig.  41  and 
Fig.  42  for  the  first  half  cycle. 


Fig.  42.  Lift  history  for  the  first  half  cycle 
{ns/F  =  30) 
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The  superiority  of  the  parallel  fluid-subcycled 
ALG3  solution  procedure  is  clearly  demon¬ 
strated  in  Fig.  41  and  Fig.  42.  For  us/f  —  10? 
ALG3  is  shown  to  be  the  closest  to  ALGO, 
which  is  supposed  to  be  the  most  accurate  since 
it  is  sequential  and  non-subcycled.  ALGl  and 
ALG2  have  comparable  accuracies.  However, 
both  of  these  algorithms  exhibit  a  significantly 
more  important  phase  error  than  ALG3,  espe¬ 
cially  for  ns/F  —  30. 

Next,  the  relative  speed  of  the  partitioned 
solution  procedures  is  assessed  by  comparing 
their  CPU  time  for  a  certain  level  of  accuracy 
dictated  by  ALGO.  For  this  problem,  it  turned 


out  that  in  order  to  meet  the  accuracy  require¬ 
ments  of  ALGO,  the  solution  algorithms  ALGl 
and  ALG2  can  subcycle  only  up  to  us/f  =  5, 
while  ALG3  can  easily  use  a  subcycling  factor 
as  large  SiSns/F  =  10.  The  performance  results 
measured  on  an  iPSC-860  system  are  reported 
in  Table  17  for  the  first  50  coupled  time-steps. 
In  this  table,  ICWF  and  ICWS  denote  the  inter¬ 
cede  communication  timings  measured  respec¬ 
tively  in  the  fluid  and  structural  kernels;  these 
timings  include  idle  and  synchronization  (wait) 
time  when  the  fluid  and  structural  communica¬ 
tions  do  not  completely  overlap.  For  program¬ 
ming  reasons,  ICWS  is  monitored  together  with 
the  evaluation  of  the  pressure  load. 


Table  17.  Performance  results  on  the  iPSC-860 
Fluid:  32  processors  Structure:  4  processors 


Elapsed  time  for  50  fluid  time-steps 


Alg. 

Fluid 

Solver 

Fluid 

Motion 

Struc. 

Solver 

ICWS 

ICWF 

Total 

CPU 

ALGO 

177.4  s. 

71.2  s. 

33.4  s. 

219.0  s. 

384.1  s. 

632.7  s. 

ALGl 

180.0  s. 

71.2  s. 

16.9  s. 

216.9  s. 

89.3  s. 

340.5  s. 

ALG2 

184.8  s. 

71.2  s. 

16.6  s. 

114.0  s. 

0.4  s. 

256.4  s. 

ALG3 

176.1  s. 

71.2  s. 

10.4  s. 

112.3  s. 

0.4  s. 

247.7  s. 

From  the  results  reported  in  Table  17,  the 
following  observations  can  be  made: 

•  the  fluid  computations  dominate  the  sim¬ 
ulation  time.  This  is  partly  because  the 
structural  model  is  again  simple  in  this 
case,  and  a  linear  elastic  behavior  is  as¬ 
sumed.  However,  by  allocating  32  proces¬ 
sors  to  the  fluid  kernel  and  4  processors  to 
the  structure  code,  a  reasonable  load  bal¬ 
ance  is  shown  to  be  achieved  for  ALGO. 

•  during  the  first  50  fluid  time-steps,  the 
CPU  time  corresponding  to  the  structural 


solver  does  not  decrease  linearly  with  the 
subcycling  factor  ris/F  because  of  the  ini¬ 
tial  costs  of  the  FETI  reorthogonalization 
procedure  designed  for  the  efficient  itera¬ 
tive  solution  of  implicit  systems  with  re¬ 
peated  right  hand  sides  (see  Section  5). 

the  effect  of  subcycling  on  intercube  com¬ 
munication  costs  is  clearly  demonstrated. 
The  impact  of  this  effect  on  the  total  CPU 
time  is  less  important  for  ALG2  and  ALG3 
which  feature  inter-field  parallelism  in  ad¬ 
dition  to  intra-field  multiprocessing,  than 
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for  ALGl  which  features  intra-field  paral¬ 
lelism  only  (note  that  ALGl  with  us/f  = 

1  is  identical  to  ALGO),  because  the  flow 
solution  time  is  dominating. 

•  ALG2  and  ALG3  allow  a  certain  amount 
of  overlap  between  inter-field  communica¬ 
tions,  which  reduces  intercube  communi¬ 
cation  and  idle  time  on  the  fluid  side  to  less 
than  0.001%  of  the  amount  corresponding 
to  ALGO. 

Most  importantly,  the  performance  results 
reported  in  Table  17  demonstrate  that  subcy¬ 
cling  and  inter-field  parallelism  are  desirable 
for  aeroelastic  simulations  even  when  the  flow 
computations  dominate  the  structural  ones,  be¬ 
cause  these  features  can  significantly  reduce  the 
total  simulation  time  by  minimizing  the  amount 
of  inter-field  communications  and  overlapping 
them.  For  the  simple  problem  described  herein, 
the  parallel  fluid-subcycled  ALG2  and  ALG3 
algorithms  are  more  than  twice  faster  than  the 
conventional  staggered  procedure  ALGO. 

10.  CONCLUSIONS 

In  this  paper,  we  have  highlighted  some  key 
elements  of  the  solution  of  large-scale  three- 
dimensional  nonlinear  aeroelastic  problems  on 
high  performance  computational  platforms.  We 
have  described  a  three-field  arbitrary  Lagrangian- 
Eulerian  (ALE)  finite  volume/element  formu¬ 
lation  for  the  coupled  fluid/structure  prob¬ 
lem,  presented  geometric  conservation  laws  for 
three-dimensional  flow  problems  with  moving 
boundaries  and  unstructured  and  deformable 
meshes,  and  discussed  the  solution  of  the  corre¬ 
sponding  coupled  semi-discrete  equations  with 
partitioned  analysis  procedures.  In  particu¬ 
lar,  we  have  presented  a  family  of  mixed  ex¬ 
plicit/implicit  staggered  solution  algorithms, 
and  discussed  them  with  particular  reference  to 
accuracy,  stability,  subcycling,  and  parallel  pro¬ 
cessing.  We  have  described  a  general  framework 


for  the  solution  of  coupled  aeroelastic  prob¬ 
lems  on  heterogeneous  and/or  parallel  compu¬ 
tational  platforms,  and  illustrated  it  with  two- 
and  three-dimensional  applications  on  an  iPSC- 
860,  a  Paragon  XP/S,  and  a  Gray  T3D  mas¬ 
sively  parallel  systems.  We  have  shown  that 
even  when  the  flow  computations  dominate  the 
total  CPU  time  of  a  coupled  aeroelastic  simula¬ 
tion,  subcycling  and  inter-field  parallelism  are 
desirable  as  they  can  significantly  speedup  the 
total  solution  time. 

ACKNOWLEDGEMENTS 

The  author  acknowledges  partial  support  by 
the  National  Science  Foundation  under  Grant 
ASC-9217394,  partial  support  by  RNR  NAS 
at  NASA  Ames  Research  Center  under  Grant 
NAG  2-827,  and  partial  support  by  CMB  at  the 
NASA  Langley  Research  Center  under  Grant 
NAG-1536427.  He  wishes  to  thank  Po-Shu 
Chen,  Stephane  Lanteri,  Michel  Lesoinne,  Serge 
Piperno,  and  Paul  Stern  for  their  help  in  this 
research  effort. 

REFERENCES 

[1]  H.  Tijdeman  and  R.  Seebass,  Transonic  flow 
past  oscillating  airfoils,  Ann.  Rev.  Fluid  Mech. 
12  (1980)  181-222. 

[2]  R.  L.  Bisplinghoff  and  H.  Ashley,  Principles 
of  aeroelasticity,  Dover  Publications,  Inc.,  1962. 

[3]  Y.  C.  Fung,  An  introduction  to  the  theory  of 
aeroelasticity,  Dover  Publications,  Inc.,  1969. 

[4]  R.  Dat  and  J.  L.  Meurzec,  Sur  les  calculs  de 
flottement  par  la  method  dite  du  “balayage”  en 
frequence  reduite.  La  Recherche  Aerospatiale 
133,  Nov.  Dec.  1969. 

[5]  C.  Bon,  M.  Geradin,  and  J.  P.  Grisval,  Pri¬ 
vate  communication,  1993. 

[6]  E.  Albano  and  W.  P.  Rodden,  A  doublet- 
lattice  method  for  calculating  lift  distribution 
on  oscillating  surfaces  in  subsonic  flow,  AIAA 
J.  7  (1969)  279-285. 


8-74 


[7]  W.  P.  Jones  and  K.  Appa,  Unsteady  super¬ 
sonic  aerodynamic  theory  for  interfering  sur¬ 
faces  by  the  method  of  potential  gradient, 
NASA  CR  2898,  October  1977. 

[8]  P.  C.  Chen  and  D.  D.  Liu,  A  harmonic  gra¬ 
dient  method  for  unsteady  supersonic  flow  cal¬ 
culations,  AIAA  830887  CP,  May  1983. 

[9]  E.  H.  Dowell  and  M.  Llgamov,  Studies  in 
Nonlinear  Aeroelasticity,  Springer- Verlag,  1988. 

[10]  J.  Donea,  An  arbitrary  Lagrangian-Eulerian 
finite  element  method  for  transient  fluid-structure 
interactions,  Comput.  Meths.  Appl.  Mech. 
Engrg.  33  (1982)  689-723. 

[11]  T.  J.  R.  Hughes,  W.  K.  Liu  and  T.  K.  Zim- 
mermann,  Lagrangian-Eulerian  finite  element 
formulation  for  incompressible  viscous  flows, 
U.S.-Japan  Seminar  on  Interdisciplinary  Finite 
Element  Analysis,  Cornell  Univ.,  Ithaca,  NY, 
Aug.  7-11  (1978). 

[12]  T.  Belytschko  and  J.  M.  Kennedy,  Com¬ 
puter  models  for  subassembly  simulation,  Nucl. 
Eng.  Design  49  (1978)  17-38. 

[13]  0.  A.  Kandil  and  H.  A.  Chuang,  Un¬ 
steady  vortex-dominated  flows  around  maneu¬ 
vering  wings  over  a  wide  range  of  mach  num¬ 
bers,  AIAA  Paper  No.  88-0317,  AIAA  26th 
Aerospace  Sciences  Meeting,  Reno,  Nevada, 
January  11-14,  1988. 

[14]  C.  Farhat  and  T.  Y.  Lin,  Transient  aeroe- 
lastic  computations  using  multiple  moving  frames 
of  reference,  AIAA  Paper  No.  90-3053,  AIAA 
8th  Applied  Aerodynamics  Conference,  Port¬ 
land,  Oregon,  August  20-22,  1990. 

[15]  J.  T.  Batina,  Unsteady  Euler  airfoil  so¬ 
lutions  using  unstructured  dynamic  meshes, 
AIAA  Paper  No.  89-0115,  AIAA  27th  Aerospace 
Sciences  Meeting,  Reno,  Nevada,  January  9-12, 
1989. 

[16]  G.  P.  Guruswamy,  Time-accurate  unsteady 
aerodynamic  and  aeroelastic  calculations  of 
wings  using  Euler  equations,  AIAA  Paper  No. 


88-2281,  AIAA  29th  Structures,  Structural  Dy¬ 
namics  and  Materials  Conference,  Williams¬ 
burg,  Virginia,  April,  18-20,  1988. 

[17]  T.  Tezduyar,  M.  Behr  and  J.  Liou,  A  new 
strategy  for  finite  element  computations  involv¬ 
ing  moving  boundaries  and  interfaces-The  de¬ 
forming  spatial  domain/space-time  procedure: 
1.  The  concept  and  the  preliminary  numerical 
tests,  Comput.  Meths.  Appl.  Mech.  Engrg.  94 
(1992)  339-351. 

[18]  M.  Lesoinne  and  C.  Farhat,  Stability  anal¬ 
ysis  of  dynamic  meshes  for  transient  aeroelastic 
computations,  AIAA  Paper  No.  93-3325,  11th 
AIAA  Computational  Fluid  Dynamics  Confer¬ 
ence,  Orlando,  Florida,  July  6-9,  1993. 

[19]  M.  Lesoinne,  Mathematical  analysis  of 
three-field  numerical  methods  for  aeroelastic 
problems,  Ph.  D.  Thesis,  The  University  of  Col¬ 
orado  at  Boulder,  December  1994. 

[20]  K.  C.  Park  and  C.  A.  Felippa,  Partitioned 
analysis  of  coupled  systems,  in:  Computational 
Methods  for  Transient  Analysis,  T.  Belytschko 
and  T.  J.  R.  Hughes,  Eds.,  North-Holland  Pub. 
Co.  (1983)  157-219. 

[21]  T.  Belytschko,  P.  Smolensk!  and  W.  K.  Liu, 
Stability  of  multi-time  step  partitioned  integra¬ 
tors  for  first-order  finite  element  systems,  Com¬ 
put.  Meths.  Appl.  Mech.  Engrg.  49  (1985) 
281-297. 

[22]  C.  Farhat,  K.  C.  Park  and  Y.  D.  Pelerin,  An 
unconditionally  stable  staggered  algorithm  for 
transient  finite  element  analysis  of  coupled  ther¬ 
moelastic  problems,  Comput.  Meths.  Appl. 
Mech.  Engrg.  85  (1991)  349-365. 

[23]  S.  Piperno,  C.  Farhat  and  B.  Larrouturou, 
Partitioned  procedures  for  the  transient  solu¬ 
tion  of  coupled  aeroelastic  problems,  Comput. 
Meths.  Appl.  Mech.  Engrg.,  (in  press). 

[24]  C.  J.  Borland  and  D.  P.  Rizzetta,  Nonlinear 
transonic  flutter  analysis,  AIAA  Journal  (1982) 
1606-1615. 


8-75 


[25]  V.  Shankar  and  H.  Ide,  Aeroelastic  compu¬ 
tations  of  flexible  configurations,  Comput.  &: 
Struc.  30  (1988)  15-28. 

[26]  R.  D.  Rausch,  J.  T.  Batina  and  T.  Y. 
Yang,  Euler  flutter  analysis  of  airfoils  using  un¬ 
structured  dynamic  meshes,  AIAA  Paper  No. 
89-13834,  30th  Structures,  Structural  Dynam¬ 
ics  and  Materials  Conference,  Mobile,  Alabama, 
April  3-5,  1989. 

[27]  M.  Blair,  M.  H.  Williams  and  T.  A.  Weis- 
shaar,  Time  domain  simulations  of  a  flexible 
wing  in  subsonic  compressible  flow,  AIAA  Pa¬ 
per  No.  90-1153,  AIAA  8th  Applied  Aerody¬ 
namics  Conference,  Portland,  Oregon,  August 
20-22,  1990. 

[28]  T.  W.  Strganac  and  D.  T.  Mook,  Numerical 
model  of  unsteady  subsonic  aeroelastic  behav¬ 
ior,  AIAA  Journal  28  (1990)  903-909. 

[29]  E.  Pramono  and  S.  K.  Weeratunga,  Aeroe¬ 
lastic  computations  for  wings  through  direct 
coupling  on  distributed-memory  MIMD  paral¬ 
lel  computers,  AIAA  Paper  No.  94-0095,  32nd 
Aerospace  Sciences  Meeting  h  Exhibit,  Reno, 
January  10-13,  1994. 

[30]  V.  Venkatakrishnan,  A  perspective  of  un¬ 
structured  grid  flow  solvers,  ICASE  Report  No. 
95-3,  NASA  Langley  Research  Center,  Febru¬ 
ary  1995. 

[31]  J.  W.  Edwards  and  J.  B.  Malone,  Current 
status  of  computational  methods  for  transonic 
unsteady  aerodynamics  and  aeroelastic  applica¬ 
tions,  Comput.  Sys.  Engrg.  3  (1992)  545-569. 

[32]  C.  Farhat,  J.  Mandel  and  F.  X.  Roux,  Opti¬ 
mal  convergence  properties  of  the  FETI  domain 
decomposition  method,  Comput.  Meths.  Appl. 
Mech.  Engrg.  115  (1994)  367-388. 

[33]  C.  Farhat,  P.  S.  Chen  and  J.  Man- 
del,  A  scalable  Lagrange  multiplier  based  do¬ 
main  decomposition  method  for  implicit  time- 
dependent  problems,  Internat.  J.  Numer. 
Meths.  Engrg.,  (in  press). 

[34]  C.  Farhat,  L.  Crivelli  and  F.  X.  Roux,  Ex¬ 
tending  substructure  based  iterative  solvers  to 


multiple  load  and  repeated  analyses,  Comput. 
Meths.  Appl.  Mech.  Engrg.  117  (1994)  195- 
209. 

[35]  N.  Maman  and  C.  Farhat,  Matching  fluid 
and  structure  meshes  for  aeroelastic  computa¬ 
tions:  a  parallel  approach,  Comput.  &  Struc. 
54  (1995)  779-785. 

[36]  M.  Lesoinne  and  C.  Farhat,  Geometric  con¬ 
servation  laws  for  aeroelastic  computations  us¬ 
ing  unstructured  dynamic  meshes,  AIAA  Paper 
95-1709,  12th  AIAA  Computational  Fluid  Dy¬ 
namics  Conference,  San  Diego,  California,  June 
19-22,  1995. 

[37]  P.  D.  Thomas  and  C.  K.  Lombard,  Geo¬ 
metric  conservation  law  and  its  application  to 
flow  computations  on  moving  grids,  AIAA  J.  17 
(1979) 1030-1037. 

[38]  B.  NKonga,  H.  Guillard,  Godunov  type 
method  on  non-structured  meshes  for  three- 
dimensional  moving  boundary  problems,  Com¬ 
put.  Meths.  Appl.  Mech.  Engrg.  113  (1994) 
183-204. 

[39]  H.  Zhang,  M.  Reggio,  J.Y.  Trepanier  and 
R.  Camarero,  Discrete  form  of  the  GCL  for 
moving  meshes  and  its  implementation  in  CFD 
schemes.  Computers  in  Fluids  22  (1993)  9-23. 

[40]  J.  Steger  and  R.  F.  Warming,  Flux  vector 
splitting  for  the  inviscid  gas  dynamic  with  ap¬ 
plications  to  finite-difference  methods,  Journ. 
of  Comp.  Phys.  40  (1981)  263-293. 

[41]  W.K.  Anderson,  J.L.  Thomas  and  C.L. 
Rumsey,  Extension  and  application  of  flux- 
vector  splitting  to  unsteady  calculations  on  dy¬ 
namic  meshes,  AIAA  Paper  No  87-1152-CP, 
1987. 

[42]  L.P.  Franca,  S.L.  Frey  and  T.J.R.  Hughes, 
Stabilized  finite  element  methods:  1.  Applica¬ 
tion  to  the  advective-diffusive  model,  Comput. 
Meths.  Appl.  Mech.  Engrg.  95  (1992)  253- 
276. 

[43]  H.  Schlichting,  Boundary  layer  theory. 
Fourth  edition,  McGraw-Hill,  New  York,  1960. 


8-76 


[44]  C.  Farhat,  M.  Lesoinne  and  N.  Maman, 
Mixed  explicit/implicit  time  integration  of  cou¬ 
pled  aeroelastic  problems:  three-field  formula¬ 
tion,  geometric  conservation  and  distributed  so¬ 
lution,  Internat.  J.  Numer.  Meths.  Fluids,  (in 
press). 

[45]  C.  Farhat,  M.  Lesoinne,  P.  S.  Chen  and 
S.  Lanteri,  Parallel  heterogeneous  algorithms 
for  the  solution  of  three-dimensional  transient 
coupled  aeroelastic  problems,  AIAA  Paper  95- 
1290,  AIAA  36th  Structural  Dynamics  Meet¬ 
ing,  New  Orleans,  Louisiana,  April  10-13,  1995. 

[46]  C.  Farhat,  S.  Lanteri  and  L.  Fezoui,  Mixed 
finite  volume/finite  element  massively  parallel 
computations:  Euler  flows,  unstructured  grids, 
and  upwind  approximations,  in  Unstructured 
Scientific  Computation  on  Scalable  Multipro¬ 
cessors,  ed.  by  P.  Mehrotra,  J.  Saltz,  and  R. 
Voigt,  MIT  Press  (1992)  253-283. 

[47]  C.  Farhat,  L.  Fezoui,  and  S.  Lanteri,  Two- 
dimensional  viscous  flow  computations  on  the 
Connection  Machine:  unstructured  meshes,  up¬ 
wind  schemes,  and  massively  parallel  computa¬ 
tions,  Comput.  Meths.  Appl.  Mech.  Engrg. 
102  (1993)  61-88. 

[48]  S.  Lanteri  and  C.  Farhat,  Viscous  flow  com¬ 
putations  on  MPP  systems:  implementational 
issues  and  performance  results  for  unstructured 
grids,  in  Parallel  Processing  for  Scientific  Com¬ 
puting,  ed.  by  R.  F.  Sincovec  et.  ai,  SIAM 
(1993)  65-70. 

[49]  C.  Farhat  and  S.  Lanteri,  Simulation 
of  compressible  viscous  flows  on  a  variety  of 
MPPs:  computational  algorithms  for  unstruc¬ 
tured  dynamic  meshes  and  performance  results, 
Comput.  Meths.  Appl.  Mech.  Engrg.  119 
(1994)  35-60. 

[50]  P.  L.  Roe,  Approximate  riemann  solvers, 
parameters  vectors  and  difference  schemes,  J. 
Comp.  Phys.  43  (1981)  357-371. 


[51]  B.  Van  Leer,  Towards  the  ultimate  conser¬ 
vative  difference  scheme  V:  a  second-order  se¬ 
quel  to  Goudonov’s  method,  J.  Comp.  Phys. 
32  (1979)  361-370. 

[52]  A.  Dervieux,  Steady  Euler  simulations  us¬ 
ing  unstructured  meshes.  Von  Karman  Institute 
Lecture  Series,  1985. 

[53]  L.  Fezoui  and  B.  Stoufflet,  A  class  of  im¬ 
plicit  upwind  schemes  for  Euler  simulations 
with  unstructured  meshes,  J.  Comp.  Phys.  84 
(1989)  174-206. 

[54]  X.-C.  Cai,  C.  Farhat,  and  M.  Sarkis, 
Schwarz  preconditioners  and  implicit  methods 
for  compressible  flows  problems  on  unstruc¬ 
tured  meshes.  Eighth  International  Conference 
on  Domain  Decomposition  Methods  for  Partial 
Differential  Equations,  AMS,  1995,  (in  press). 

[55]  X.-C.  Cai  and  O.B.  Widlund,  Multiplica¬ 
tive  Schwarz  algorithms  for  nonsymmetric  and 
indefinite  elliptic  problems,  SIAM  J.  Numer. 
Anal.  30  (1993)  936-952. 

[56]  X.-C.  Cai,  W.  D.  Gropp,  and  D.  E.  Keyes, 
A  comparison  of  some  domain  decomposition 
and  ILU  preconditioned  iterative  methods  for 
nonsymmetric  elliptic  problems,  Numer.  Lin. 
Alg.  Applies  1  (1994)  477-504. 

[57]  C.  Farhat  and  F.  X.  Roux,  Implicit  parallel 
processing  in  structural  mechanics.  Computa¬ 
tional  Mechanics  Advances  2  (1994)  1-124. 

[58]  C.  Farhat  and  E.  Wilson,  A  new  finite  el¬ 
ement  concurrent  computer  program  architec¬ 
ture,  Internat.  J.  Numer.  Meths.  Engrg.  24 
(1987)  1771-1792. 

[59]  P.  E.  Bjordstad  and  O.  B.  Widlund,  Iter¬ 
ative  methods  for  solving  elliptic  problems  on 
regions  partitioned  into  substructures,  SIAM  J. 
of  Num.  Anal.  23  (1986)  1097-1120. 

[60]  C.  Farhat,  A  Lagrange  multiplier  based  di¬ 
vide  and  conquer  finite  element  algorithm,  J. 
Comput.  Sys.  Engrg.  2  (1991)  149-156. 

[61]  C.  Farhat  and  F.  X.  Roux,  A  method  of 
finite  element  tearing  and  interconnecting  and 


8-77 


its  parallel  solution  algorithm,  Internat.  J.  Nu- 
mer.  Meths.  Engrg.  32  (1991)  1205-1227. 

[62]  J.  H.  Bramble,  J.  E.  Pasciak,  and  A.  H. 
Schatz,  The  construction  of  preconditioners  for 
elliptic  problems  by  substructuring,  I,  Math. 
Comp.  47  (1986)  103-134. 

[63]  L  S  Duff,  Parallel  implementation  of  mul- 
tifrontal  schemes.  Parallel  Computing  3  (1986) 
193-204. 

[64]  J.  W.  H.  Liu,  The  multifrontal  method  for 
sparse  matrix  solution:  theory  and  practice, 
SIAM  Review  34  (1992)  82-109. 

[65]  R.  E.  Benner,  G.  R.  Montry  and  G. 
G.  Weigand,  Concurrent  multifrontal  methods: 
shared  memory,  cache,  and  frontwidth  issues, 
Int.  J.  Supercomp.  Appl.  1  (1987)  26-44. 

[66]  M.  Lesoinne,  C.  Farhat  and  M.  Geradin, 
Parallel/vector  improvements  of  the  frontal 
method,  Internat.  J.  Numer.  Meths.  Engrg. 
32  (1991)  1267-1282. 

[67]  A.  Pothen  and  C.  Sun,  A  mapping  algo¬ 
rithm  for  parallel  sparse  matrix  factorization, 
SIAM  J.  Sci.  Comput.  4  (1993)  1253-1257. 

[68]  A.  Pothen,  E.  Rothberg,  H.  Simon  and 
L.  Wang,  Parallel  sparse  Cholesky  factorization 
with  spectral  nested  dissection  ordering,  RNR- 
094-011,  NASA  Ames  Research  Ceiiter,  May 
1994. 

[69]  M.  Dryja  and  0.  B.  Widlund,  Domain 
decomposition  algorithms  with  small  overlap, 
SIAM  J.  Sci.  Comput.  15  (1994)  604-620. 

[70]  J.  Mandel  and  R.  Tezaur,  Convergence  of 
a  substructuring  method  with  Lagrange  multi¬ 
pliers,  Numerische  Mathematik,  (in  press). 

[71]  C.  Farhat,  Optimizing  substructuring  meth¬ 
ods  for  repeated  right  hand  sides,  scalable  par¬ 
allel  coarse  solvers,  and  global/local  analysis, 
in:  D.  Keyes,  Y.  Saad  and  D.  G.  Truhlar, 
eds.,  Domain-Based  Parallelism  and  Problem 
Decomposition  Methods  in  Computational  Sci¬ 
ence  and  Engineering,  SIAM  (1995)  141-160. 


[72]  C.  Farhat,  P.  S.  Chen  and  P.  Stern, 
Towards  the  ultimate  iterative  substructuring 
method:  combined  numerical  and  parallel  seal- 
ability,  and  multiple  load  cases,  J.  Comput. 
Sys.  Engrg.  117  (1994)  195-209. 

[73]  C.  Farhat  and  F.X.  Roux,  An  unconven¬ 
tional  domain  decomposition  method  for  an  ef¬ 
ficient  parallel  solution  of  large-scale  finite  ele¬ 
ment  systems,  SIAM  J.  Sc.  Stat.  Comp.  13 
(1992)  379-396. 

[74]  C.  Farhat,  L.  Crivelli  and  F.  X.  Roux, 
A  transient  FETI  methodology  for  large-scale 
parallel  implicit  computations  in  structural  me¬ 
chanics,  Internat.  J.  Numer.  Meths.  Engrg.  37 
(1994)  1945-1975. 

[75]  J.  Mandel  and  C.  Farhat,  The  FETI 
method  for  plate  problems,  Comput.  Meths. 
Appl.  Mech.  Engrg.,  (in  press). 

[76]  C.  Farhat,  L.  Crivelli  and  M.  Geradin,  On 
the  spectral  stability  of  time  integration  algo¬ 
rithms  for  a  class  of  constrained  dynamics  prob¬ 
lems,  AIAA  Paper  93-1306,  AIAA  34th  Struc¬ 
tural  Dynamics  Meeting,  1993. 

[77]  C.  Farhat  and  M.  Geradin,  Using  a  reduced 
number  of  Lagrange  multipliers  for  assembling 
parallel  incomplete  field  finite  element  approxi¬ 
mations,  Comput.  Meths.  Appl.  Mech.  Engrg. 
97  (1992)  333-354. 

[78]  C.  Farhat  and  M.  Geradin,  On  a  component 
mode  synthesis  method  and  its  application  to 
incompatible  substructures,  Comput.  &;  Struc. 
51  (1994)  459-473. 

[79]  L.  Petzold,  Differential/algebraic  equations 
are  not  ODE’s,  SIAM  J.  Sci.  Stat.  Comput.  3 
(1982)  367-384. 

[80]  Y.  Saad,  On  the  Lanezos  method  for  solv¬ 
ing  symmetric  linear  systems  with  several  right- 
hand  sides.  Math.  Comp.  48  (1987)  651-662. 

[81]  C.  Farhat  and  P.  S.  Chen,  Tailoring  do¬ 
main  decomposition  methods  for  efficient  par¬ 
allel  coarse  grid  solution  and  for  systems  with 
many  right  hand  sides,”  Contemporary  Mathe¬ 
matics  180  (1994)  401-406.  ' 


8-78 


[82]  C.  Farhat,  S.  Lanteri  and  H.  D.  Simon, 
TOP/DOMDEC,  A  software  tool  for  mesh  par¬ 
titioning  and  parallel  processing,  J.  Comput. 
Sys.  Engrg.  (in  press). 

[83]  H.  D.  Simon,  Partitioning  of  unstructured 
problems  for  parallel  processing,  Comput.  Sys. 
Engrg.  2  (1991)  135-148. 

[84]  C.  Farhat,  A  simple  and  efficient  automatic 
FEM  domain  decomposer,  Comput.  k  Struct. 
28  (1988)  579-602. 

[85]  C.  Farhat  and  M.  Lesoinne,  Automatic  par¬ 
titioning  of  unstructured  meshes  for  the  parallel 
solution  of  problems  in  computational  mechan¬ 
ics,  Internat.  J.  Numer.  Meths.  Engrg.  36 
(1993)  745-764. 

[86]  J.  G.  Malone,  Automated  mesh  decom¬ 
position  and  concurrent  finite  element  analy¬ 
sis  for  hypercube  multiprocessors  computers, 
Comput.  Meths.  Appl.  Mech.  Engrg.  70 
(1988)  27-58. 

[87]  J.  Flower,  S.  Otto  and  M.  Salama,  Optimal 
mapping  of  irregular  finite  element  domains  to 
parallel  processors,  in:  A.  K.  Noor,  ed..  Parallel 
Computations  and  Their  Impact  on  Mechanics, 
The  American  Society  of  Mechanical  Engineers, 
AMD-Vol.  86  (1987)  239-252. 

[88]  A.  Pothen,  H.  Simon  and  K.  P.  Liou,  Par¬ 
titioning  sparse  matrices  with  eigen  vectors  of 
graphs,  SIAM  J.  Mat.  Anal.  Appl.  11  (1990) 
430-452.  (1990) 

[89]  C.  Farhat,  E.  Wilson  and  G.  Powell,  So¬ 
lution  of  finite  element  systems  on  concurrent 
processing  computers,  Engrg.  with  Comput.  2 
(1987)  157-165. 

[90]  A.  I.  Khan  and  B.  H.  V.  Topping,  Subdo¬ 
main  generation  for  parallel  finite  element  anal¬ 
ysis,  Comput.  Sys.  Engrg.  4  (1993)  473-488. 

[91]  P.  Ciarlet  and  F.  Lamour,  An  efficient  low- 
cost  greedy  graph  partitioning  heuristic,  UCLA 
CAM  Report  94-1. 

[92]  P.  Ciarlet  and  F.  Lamour,  Recursive  parti¬ 
tioning  methods  and  greedy  partitioning  meth¬ 
ods:  a  comparison  on  finite  element  graphs. 


UCLA  CAM  Report  94-9  (also  submitted  to  In¬ 
ternat.  J.  High  Speed  Computing). 

[93]  S.  T.  Barnard  and  H.  D.  Simon,  A  fast  mul¬ 
tilevel  implementation  of  recursive  spectral  bi¬ 
section  for  partitioning  unstructured  problems. 
Concurrency:  Practice  and  Experience  6  (1994) 
101-107. 

[94]  J.  Zdenek,  K.  K.  Mathur,  S.  L.  Johnsson 
and  T.  J.  R.  Hughes,  An  efficient  communi¬ 
cation  strategy  for  finite  element  methods  on 
the  Connection  Machine  CM-5  system,  Com¬ 
put.  Methds.  Appl.  Mech.  Engrg.  113  (1994) 
363-387. 

[95]  O.  Zone,  D.  Vanderstraeten,  P.  Henriksen, 
and  R.  Keunings,  A  parallel  direct  solver  for 
implicit  finite  element  problems  based  on  auto¬ 
matic  domain  decomposition,  Proc.  Int.  Conf. 
on  Massively  Parallel  Processing  Applications 
and  Development,  L.  Dekker  (Ed.),  Elsevier,  (in 
press). 

[96]  C.  Farhat,  N.  Maman  and  G.  Brown,  Mesh 
partitioning  for  implicit  computations  via  iter¬ 
ative  domain  decomposition:  impact  and  op¬ 
timization  of  the  subdomain  aspect  ratio,”  In¬ 
ternat.  J.  Numer.  Meths.  Engrg.  38  (1995) 
989-1000. 

[97]  D.  Vanderstraeten,  R.  Keunings  and  C. 
Farhat,  Optimization  of  mesh  partitions  and 
impact  on  parallel  GFD,  Proceedings  Parallel 
CFD’93,  Paris,  France,  May  10-12  (1993). 

[98]  D.  Vanderstraeten  and  R.  Keunings,  Op¬ 
timized  partitioning  of  unstructured  computa¬ 
tional  grids,  Internat.  J.  Numer.  Meths.  En¬ 
grg.  38  (1995)  433-450. 

[99]  D.  Vanderstraeten,  C.  Farhat,  P.  S.  Chen, 
R.  Keunings,  and  0.  Zone,  A  retrofit  and  con¬ 
traction  based  methodology  for  the  fast  gen¬ 
eration  and  optimization  of  mesh  partitions: 
beyond  the  minimum  interface  size  criterion, 
Comput.  Meths.  Appl.  Mech.  Engrg.,  (in 
press). 

[100]  S.  Kirkpatrick,  C.  Gelatt  and  M.  Vecchi, 
Optimization  by  simulated  annealing.  Science 
220  (1983)  671-680. 


8-79 


[101]  F.  Glover,  C.  McMillan  and  B.  Novick,  In¬ 
teractive  decision  softv^are  and  computer  graph¬ 
ics  for  architectural  and  space  planning,  Ann. 
Opns.  Res.  5  (1985)  557-573. 

[102]  Y.  G.  Saab  and  V.  B.  Rao,  Combinato¬ 
rial  optimization  by  stochastic  evolution,  IEEE 
Trans.  C.A.D.  10  (1991)  525-535. 


[103]  T.  N.  Bui  and  C.  Jones,  A  heuristic  for 
reducing  fill-in  in  sparse  matrix  factorization. 
Proceedings  of  the  Sixth  SIAM  Conference  on 
Parallel  Processing  for  Scientific  Computing, 
Norfolk,  Virginia,  (1993)  445-452. 

[104]  E.  Barszcz,  Intercube  communication 
on  the  iPSC/860,  Scalable  High  Performance 
Computing  Conference,  Williamsburg,  April 
26-29,  1992. 


REPORT  DOCUMENTATION  PAGE 


1.  Recipient’s  Reference  2.  Originator’s  Reference  3.  Further  Reference 


AGARD-R-807 


ISBN  92-836-1025-3 


5.  Originator  Advisory  Group  for  Aerospace  Research  and  Development 
North  Atlantic  Treaty  Organization 
7  rue  Ancelle,  92200  Neuilly-sur-Seine,  France 


4.  Security  Classification 
of  Document 

UNCLASSIFIED/ 

UNLIMITED 


Parallel  Computing  in  CFD 
7.  Presented  at/sponsored  by 

AGARD-FDP-VKI  Special  Course  at  the  VKI,  Rhode-Saint-Genese,  Belgium, 
15-19  May  1995  and  16-20  October  1995  at  NASA  Ames,  United  States. 


8.  Author(s)/Editor(s) 

Multiple 

10.  Author’s/Editor’s  Address 
Multiple 

12.  Distribution  Statement 


13.  Keywords/Descriptors 


9.  Date 


October  1995 


11.  Pages 
352 

There  are  no  restrictions  on  the  distribution  of  this  document. 
Information  about  the  availability  of  this  and  other  AGARD 
unclassified  publications  is  given  on  the  back  cover. 


Fluid  dynamics 
Computational  fluid  dynamics 
Algorithms 
Differential  equations 
Fluid  flow 


Parallel  computing 
Compressible  flow 
Incompressible  flow 
Domain  decomposition  algorithms 
Partitioning  techniques 


14.  Abstract 


Lecture  notes  for  the  AGARD  Fluid  Dynamics  Panel  (FDP)  Special  Course  on  “Parallel 
Computing  in  CFD”  have  been  assembled  in  this  report.  The  aim  and  scope  of  this  Course  was 
to  present  and  discuss  the  latest  in  advances  and  future  trends  in  the  application  of  parallel 
computing  to  solve  computationally  intensive  problems  in  CFD.  Topics  in  this  lecture  series 
focus  on  the  increasingly  sophisticated  types  of  architectures  now  available,  and  how  to  exploit 
these  architectures  by  appropriate  algorithms  for  the  simulation  of  fluid  flow.  Some  of  the 
subjects  discussed  are:  parallel  algorithms  for  computing  compressible  and  incompressible  flow; 
domain  decomposition  algorithms  and  partitioning  techniques;  and  parallel  algorithms  for  solving 
linear  systems  arising  from  the  discretized  partial  differential  equations. 

The  material  assembled  in  this  report  was  prepared  under  the  combined  sponsorship  of  the 
AGARD  Fluid  Dynamics  Panel,  the  Consultant  and  Exchange  Program  of  AGARD,  and  the  von 
Karman  Institute  (VKI)  for  Fluid  Dynamics. 


NATO  OTAN 

7  RUE  ANCELLE  •  92200  NEUILLY-SUR-SEINE  DIFFUSION  DES  PUBLICATIONS 

FRANCE  AGARD  NON  CLASSIFIEES 

_ Telecople  (1)47.38.57.99  »  Telex  610  176 _ 

Aucun  stock  de  publications  n’a  existe  a  AGARD.  A  partir  de  1993,  AGARD  detiendra  un  stock  limite  des  publications  associees  aux  cycles 
de  conferences  et  cours  speciaux  ainsi  que  les  AGARDographies  et  les  rapports  des  groupes  de  travail,  organises  et  publics  a  partir  de  1993 
inclus.  Les  demandes  de  renseignements  doivent  etre  adressees  a  AGARD  par  lettre  oti  par  fax  a  I’adresse  indiquee  ci-de.sstis.  Veuillezne 
pas  telephoner.  La  diffusion  initiale  de  toutes  les  publications  de  I’AGARD  est  effectuee  aupres  des  pays  membres  de  I’OTAN  par 
I’intermediaire  des  centres  de  distribution  nationaiix  indiques  ci-dessous.  Des  exeinplaires  supplementaires  peuvent  parfois  etre  obtentis 
aupres  de  ces  centres  (a  I’exception  des  Etat.s-Unis).  Si  vous  soubaitez  recevoir  toutes  les  publications  de  I’AGARD,  ou  simplement  cedes 
qui  concernent  certains  Panels,  vous  pouvez  demander  a  etre  inclu  sur  la  liste  d’envoi  de  I’un  de  ces  centres.  Les  publications  de  TAGARD 
sont  en  vente  aupres  des  agences  indiquees  ci-dessous,  sous  forme  de  photocopie  ou  de  microfiche. 


CENTRES  DE  DIFFUSION  NATIONAUX 


ALLEMAGNE 

Fachinformationszentrum, 

Karlsruhe 

D-76344  Eggenstein-Leopoldshafen  2 
BELGIQUE 

Coordonnateur  AGARD-VSL 
Etat-major  de  la  Force  aerienne 
Quartier  Reine  Elisabeth 
Rue  d’Evere,  1 140  Bruxelles 
CANADA 

Directeur,  Services  d’information  scientifique 
Ministere  de  la  Defense  nationale 
Ottawa.  Ontario  KIA  0K2 
DANEMARK 

Danish  Defence  Research  Establishment 
Ryvangs  Alle  1 
P.O.  Box  2715 
DK-2100  Copenhagen  0 
ESPAGNE 

INTA  (AGARD  Publications) 

Pintor  Rosales  34 
28008  Madrid 
ETATS-UNIS 

NASA  Headquarters 
Code  JOB-1 

Washington,  D.C.  20546 


ISLANDE 

Director  of  Aviation 
c/o  Flugrad 
Reykjavik 
ITALIE 

Aeronautica  Militare 

Ufficio  del  Delegate  Nazionale  all’AGARD 
Aeroporto  Pratica  di  Mare 
00040  Pomezia  (Roma) 

LUXEMBOURG 
Voir  Belgique 
NORVEGE 

Norwegian  Defence  Research  Establishment 
Attn:  Biblioteket 
P.O.  Box  25 
N-2007  Kjeller 
PAYS-BAS 

Netherlands  Delegation  to  AGARD 
National  Aerospace  Laboratory  NLR 
P.O.  Box  90502 
1006  BM  Amsterdam 
PORTUGAL 

For^a  Aerea  Portuguesa 

Centro  de  Documentafao  e  Informagao 

Alfragide 

2700  Amadora 


FRANCE 

O.N.E.R.A.  (Direction) 

29,  Avenue  de  la  Division  Leclerc 
92322  Chatillon  Cedex 


ROYAUME-UNI 

Defence  Research  Information  Centre 
Kentigern  House 
65  Brown  Street 


GRECE 

Hellenic  Air  Force 
Air  War  College 
Scientific  and  Technical  Library 
Dekelia  Air  Force  Base 
Dekelia,  Athens  TGA  1010 


Glasgow  G2  8EX 
TURQUIE 

Mill?  Savunma  Ba^kanligi  (MSB) 
ARGE  Dairesi  Ba^kanligi  (MSB) 
06650  Bakanliklar-Ankara 


Le  centre  de  distribution  national  des  Etats-Unis  ne  detient  PAS  de  stocks  des  publications  de  I’AGARD. 
D’eventuelles  demandes  de  photocopies  doivent  gtre  formulees  directement  aupres  du  NASA  Center  for  AeroSpace  Information  (CASI) 
a  I’adresse  ci-dessous.  Toute  notification  de  changement  d’adresse  doit  etre  fait  egalement  aupres  de  CASI. 


AGENCES  DE  VENTE 

NASA  Center  for  ESA/Information  Retrieval  Service  The  British  Library 

AeroSpace  Information  (CASI)  European  Space  Agency  Document  Supply  Division 

800  Elkridge  Landing  Road  10,  rue  Mario  Nikis  Boston  Spa,  Wetherby 

Linthicum  Heights,  MD  21090-2934  75015  Paris  West  Yorkshire  LS23  7BQ 

Etats-Unis  France  Royaume-Uni 

Les  demandes  de  microfiches  ou  de  photocopies  de  documents  AGARD  (y  compris  les  demandes  faites  aupres  du  CASI)  doivent 
comporter  la  denomination  AGARD,  ainsi  que  le  numero  de  serie  d’ AGARD  (par  exemple  AGARD-AG-315).  Des  informations 
analogues,  telles  que  le  titre  et  la  date  de  publication  sont  souhaitables.  Veuiller  noter  qu’il  y  a  lieu  de  specifier  AGARD-R-nnn  et 
AGARD-AR-nnn  lors  de  la  commande  des  rapports  AGARD  et  des  rapports  consultatifs  AGARD  respectivement.  Des  references 
bibliographiques  completes  ainsi  que  des  resumes  des  publications  AGARD  figurent  dans  les  journaux  suivants: 


Scientific  and  Technical  Aerospace  Reports  (STAR) 
publie  par  la  NASA  Scientific  and  Technical 
Information  Division 
NASA  Headquarters  (ITT) 

Washington  D.C.  20546 
Etats-Unis 


Government  Reports  Announcements  and  Index  (GRA&I) 

publie  par  le  National  Technical  Information  Service 

Springfield 

Virginia  22161 

Etats-Unis 

(accessible  egalement  en  mode  interactif  dans  la  base  de 
donnees  bibliographiques  en  ligne  du  NTIS,  et  sur  CD-ROM) 


Imprime  par  le  Graupe  Communication  Canada 
45.  haul.  Sacre-Coeur,  Hull  (Quebec),  Canada  KIA  0S7 


NATO  OTAN 

7  RUE  ANCELLE  •  92200  NEUILLY-SUR-SEINE 
FRANCE 


DISTRIBUTION  OF  UNCLASSIFIED 
AGARD  PUBLICATIONS 


_ Telefax  (1)47.38.57.99  »  Telex  610  176 _ _ 

AGARD  holds  limited  quantities  of  the  publications  that  accompanied  Lecture  Series  and  Special  Courses  held  in  1993  or  later,  and  of 
AGARDographs  and  Working  Group  reports  published  from  1993  onward.  For  details,  write  or  send  a  telefax  to  the  address  given  above. 
Please  do  not  telephone.  ,  ,  .  .  , 

AGARD  does  not  hold  stocks  of  publications  that  accompanied  earlier  Lecture  Series  or  Courses  or  of  any  other  publications.  Initial 
distribution  of  all  AGARD  publications  is  made  to  NATO  nations  through  the  National  Distribution  Centres  listed  below.  Further  copies  are 
sometimes  available  from  these  centres  (except  in  the  United  States).  If  you  have  a  need  to  receive  all  AGARD  publications,  or  just  those 
relating  to  one  or  more  specific  AGARD  Panels,  they  may  be  willing  to  include  you  (or  your  organisation)  on  their  distribution  list. 
AGARD  publications  may  be  purchased  from  the  Sales  Agencies  listed  below,  in  photocopy  or  microfiche  form. 


NATIONAL  DISTRIBUTION  CENTRES 


BELGIUM 

Coordonnateur  AGARD  —  VSL 
Etat-major  de  la  Force  aerienne 
Quartier  Reine  Elisabeth 
Rue  d’Evere,  1140  Bruxelles 
CANADA 

Director  Scientific  Information  Services 
Dept  of  National  Defence 
Ottawa,  Ontario  KIA  0K2 
DENMARK 

Danish  Defence  Research  Establishment 
Ryvangs  Alle  1 
P.O.  Box  2715 
DK-2100  Copenhagen  0 
FRANCE 

O.N.E.R.A.  (Direction) 

29  Avenue  de  la  Division  Leclerc 
92322  Chatillon  Cedex 


LUXEMBOURG 
See  Belgium 

NETHERLANDS 

Netherlands  Delegation  to  AGARD 
National  Aerospace  Laboratory,  NLR 
P.O.  Box  90502 
1006  BM  Amsterdam 

NORWAY 

Norwegian  Defence  Research  Establishment 
Attn;  Biblioteket 
P.O.  Box  25 
N-2007  Kjeller 

PORTUGAL 

For9a  Aerea  Portuguesa 

Centro  de  Documentapao  e  Informa5ao 

Alfragide 

2700  Amadora 


GERMANY 

Fachinformationszentrum 

Karlsruhe 

D-76344  Eggenstein-Leopoldshafen  2 
GREECE 

Hellenic  Air  Force 
Air  War  College 
Scientific  and  Technical  Library 
Dekelia  Air  Force  Base 
Dekelia,  Athens  TGA  1010 
ICELAND 

Director  of  Aviation 
c/o  Flugrad 
Reykjavik 
ITALY 

Aeronautica  Militate 

Ufficio  del  Delegato  Nazionale  all’ AGARD 
Aeroporto  Pratica  di  Mare 
00040  Pomezia  (Roma) 


SPAIN 

INTA  (AGARD  Publications) 

Pintor  Rosales  34 
28008  Madrid 

TURKEY 

Mill!  Savunma  Ba^kanligi  (MSB) 
ARGE  Dairesi  Ba^kanligi  (MSB) 
06650  Bakanliklar-Ankara 

UNITED  KINGDOM 

Defence  Research  Information  Centre 
Kentigem  House 
65  Brown  Street 
Glasgow  G2  SEX 

UNITED  STATES 

NASA  Headquarters 
Code  JOB-1 

Washington,  D.C.  20546 


The  United  States  National  Distribution  Centre  does  NOT  hold  stocks  of  AGARD  publications. 

Applications  for  copies  should  be  made  direct  to  the  NASA  Center  for  AeroSpace  Information  (CASI)  at  the  address  below. 

Change  of  address  requests  should  also  go  to  CASI. 


SALES  AGENCIES 


NASA  Center  for 

AeroSpace  Information  (CASI) 
800  Elkridge  Landing  Road 
Linthicum  Heights,  MD  21090-2934 
United  States 


ESA/Information  Retrieval  Service  The  British  Library 

European  Space  Agency  Document  Supply  Centre 

10,  rue  Mario  Nikis  Boston  Spa,  Wetherby 

75015  Paris  West  Yorkshire  LS23  7BQ 

France  United  Kingdom 


Requests  for  microfiches  or  photocopies  of  AGARD  documents  (including  requests  to  CASI)  should  include  the  word  ‘AGARD’ 
and  the  AGARD  serial  number  (for  example  AGARD-AG-315).  Collateral  information  such  as  title  and  publication  date  is 
desirable.  Note  that  AGARD  Reports  and  Advisory  Reports  should  be  specified  as  AGARD-R-nnn  and  AGARD-AR-nnn, 
respectively.  Full  bibliographical  references  and  abstracts  of  AGARD  publications  are  given  in  the  following  journals: 


Scientific  and  Technical  Aerospace  Reports  (STAR) 
published  by  NASA  Scientific  and  Technical 
Information  Division 
NASA  Headquarters  (ITT) 

Washington  D.C.  20546 
United  States 


Government  Reports  Announcements  and  Index  (GRA&I) 

published  by  the  National  Technical  Information  Service 

Springfield 

Virginia  22161 

United  States 

(also  available  online  in  the  NTIS  Bibliographic 
Database  or  on  CD-ROM) 


Printed  by  Canada  Communication  Group 
45  Sacre-Caeur  Blvd.,  Hull  (Quebec),  Canada  KIA  0S7 


ISBN  92-836-1025-3 


